Use JupyterLab tokens to strongly type dataset interfaces

ellisonbg commented 3 years ago

Description

In the current API, there is an informal (not strongly typed) relationship between the MIME-type and the data value T. In another new issue #146 I have introduced some ideas for replacing the MIME-type with more structured (abstract data type, serialization format, storage medium), but that doesn't solve the issue of the informal relationship between this tuple and the data type T. Put another way, how do extensions and user of these API know what T will be for a given (abstract data type, serialization format, storage medium) combination. For example, if the tuple is (tabular, csv, in memory) T would be a string and contain the csv file in memory as its value.

To formalize this and better enable strong typing, I propose that we use the "token" trick of JupyterLab that provides a way to have a runtime manifestation of an interface, to allow other extensions to register combinations of (abstract data type, serialization format, storage medium) along with the interface (as a runtime token) that they point to. Something like this:


export const IInMemoryCSV = new Token<IInMemoryCSV>(
  '@jupyterlab/dataregistry:IInMemoryCSV'
);

export interface IInMemoryCSV {
  value: string;
}

dataRegistry.registerDataType({
    abstractDataType: "tabular".
    serialization: "csv",
    storage: "inmemory",
    interface: IInMemoryCSV
});

There are more details to work in about how the token/interface would be threaded through all the APIs, but hopefully this captures the main idea.

3coins commented 3 years ago

Is value a common interface that all datasets will follow? Is the expectation correct that the value field will always represent the final in memory representation of the dataset, e.g., for a CSV, would this always be string regardless of the storage medium? Translating this to S3, CSV use case:

export interface IS3Props {
    uri: string
}

export interface ICSVProps {
    delimiter: string,
    lineDelimiter: string,
    header: boolean,
    quotes: boolean
}

export const IS3CSV = new Token<IS3CSV>(
  '@jupyterlab/dataregistry:IS3CSV'
);

export interface IS3CSV {
  value: string,
  s3Props: IS3Props,
  csvProps: ICSVProps
}

ellisonbg commented 3 years ago

Great question. My initial thinking is that the interface would be unique for each unique (abstract data type, serialization format, storage medium) tuple. At the same time, if there are data interfaces that do make sense for multiple combinations we could introduce a data type registration approach that allows those combinations to be registered in a single shot. I don't have a great usage case ATM but from a syntax perspective, something like:

dataRegistry.registerDataType({
    abstractDataType: ["tabular", "text", "object"],
    serialization: "text",
    storage: "URL",
    interface: IText
});

With that said, this approach does allow the implementers of different data interfaces to extend each other. So maybe:

export interface IInMemoryText {
    value: string
}

export interface InMemoryCSV extends IInMemoryText {
    delimeter: string
}

jupyterlab / jupyterlab-data-explorer

Use JupyterLab tokens to strongly type dataset interfaces #147

Description