The current dataset API looks something like this:
type URL_ = string;
type MimeType_ = string;
type Cost = number;
type DataValue<T> = [Cost, T];
type Dataset<T> = Map<MimeType_, DataValue<T>>;
type Datasets<T> = Map<URL_, Dataset<T>>;
The core idea is that the MimeType is used to encode the data type that is expressed in T. This idea was inspired by Jupyter MIME-type based output system. However, as we have used this API, we have found that it quickly becomes quite painful to express realistic data types. In practice, we find that there are three dimensions of a dataset "type":
The abstract data type (tabular, relational database, image, tensor, text, and collections thereof).
The serialization format (CSV, JSON, PNG, JPEG, etc.).
The storage medium (in memory, fielsystem, S3, URL, API, etc.)
Given this background we were finding that we had to invent awkward multipart MIME-types that encoded all this information into a single informal string format. We would like to propose a new for the data types that separates the abstract data type, serialization format and storage medium into distinct fields. In addition to these three fields, there would still be a unique identifier (URL/URI), and the generic T which contains the actual data or a pointer to it.
Description
The current dataset API looks something like this:
The core idea is that the
MimeType
is used to encode the data type that is expressed inT
. This idea was inspired by Jupyter MIME-type based output system. However, as we have used this API, we have found that it quickly becomes quite painful to express realistic data types. In practice, we find that there are three dimensions of a dataset "type":Given this background we were finding that we had to invent awkward multipart MIME-types that encoded all this information into a single informal string format. We would like to propose a new for the data types that separates the abstract data type, serialization format and storage medium into distinct fields. In addition to these three fields, there would still be a unique identifier (URL/URI), and the generic
T
which contains the actual data or a pointer to it.