jupyterlab / jupyterlab-data-explorer

First class datasets in JupyterLab
BSD 3-Clause "New" or "Revised" License
178 stars 38 forks source link

Core dataset API #146

Closed ellisonbg closed 1 year ago

ellisonbg commented 3 years ago

Description

The current dataset API looks something like this:

type URL_ = string;
type MimeType_ = string;
type Cost = number;
type DataValue<T> = [Cost, T];
type Dataset<T> = Map<MimeType_, DataValue<T>>;
type Datasets<T> = Map<URL_, Dataset<T>>;

The core idea is that the MimeType is used to encode the data type that is expressed in T. This idea was inspired by Jupyter MIME-type based output system. However, as we have used this API, we have found that it quickly becomes quite painful to express realistic data types. In practice, we find that there are three dimensions of a dataset "type":

Given this background we were finding that we had to invent awkward multipart MIME-types that encoded all this information into a single informal string format. We would like to propose a new for the data types that separates the abstract data type, serialization format and storage medium into distinct fields. In addition to these three fields, there would still be a unique identifier (URL/URI), and the generic T which contains the actual data or a pointer to it.