Add converters for tabular data

saulshanabrook commented 5 years ago

@ellisonbg mentioned that it would be good to support some default tabular data formats, to convert between them.

~~CSV string~~ Done
~~JSON table schema~~ Done
inline format that Vega supports (basically list of objects)
Datagrid model object (should be extracted from current CSV viewer implementation).

For each of these, we should define a data type, and define converters between them. Then we should make sure they work on some test datasets.

Some pipelines that should work after this:

Open CSV files with nteract data viewer, by first converting to JSON table schema
View pandas dataframe output in datagrid, by going from JSON table schema to datagrid model
If we create a Vega Lite spec that refers to a dataset by url like file:///notebooks/Table.ipynb#/cells/4/outputs/0/data/application/vnd.dataresource+json, then this should use the pandas output from that cell in the notebook as an input to the vega spec. Depends on https://github.com/jupyterlab/jupyterlab-data-explorer/issues/20

Nestak2 commented 4 years ago

@saulshanabrook I have a question, that looks related to your post - When using jupyterlab-data-explorer, how can I convert a pandas dataframe to json in the notebook, so that I can use the different graphic options of nteract's data-explorer? As an example, see here the example jupyterlab-data-explorer notebook, where you have the different graphical options in the red rectangle. On the otherhand, this graphical options are not available for plain dataframes (example in my personal notebook). How can I make them available?

saulshanabrook commented 4 years ago

@Nestak2 You have to set pandas.set_option('display.html.table_schema', True) so that it outputs the JSON, like in this examples.

Nestak2 commented 4 years ago

@saulshanabrook Thank you very much, I didn't know the purpose of this line, now the graphical features are there!

saulshanabrook commented 4 years ago

Great! Glad it's working for you. I have added this to the usage docs to hopefully make it more clear in the future: https://github.com/jupyterlab/jupyterlab-data-explorer/pull/135

westurner commented 4 years ago

FWIW,

tablib https://tablib.readthedocs.io/en/stable/ supports a bunch of formats: 'cli, csv, dbf, df (DataFrame), html, jira, json, latex, ods, rst, tsv, xls, xlsx, yaml'
Tabulate does [HTML, LaTeX, *] tables from lists of lists, lists of dicts, etc. (without pandas) https://github.com/astanin/python-tabulate . Hoping for these to land in a release soon:
- Jupyter support: https://github.com/astanin/python-tabulate/pull/27
- https://github.com/astanin/python-tabulate/pull/26
odo https://github.com/blaze/odo (2018) does conversion between very many formats:

Odo migrates data using network of small data conversion functions between type pairs. That network is below: odo conversions

Each node is a container type (like pandas.DataFrame or sqlalchemy.Table) and each directed edge is a function that transforms or appends one container into or onto another. We annotate these functions/edges with relative costs.

This network approach allows odo to select the shortest path between any two types (thank you networkx). For performance reasons these functions often leverage non-Pythonic systems like NumPy arrays or native CSV->SQL loading functions. Odo is not dependent on only Python iterators.
Ibis https://docs.ibis-project.org/backends.html
- Impala, BigQuery, HDFS, Spark, SQLAlchemy, Pandas
blazingsql https://github.com/BlazingDB/blazingsql is really fast. It reads into the GPU from CSV, TSV, JSON, Apache Parquet, Apache ORC, Apache Hive, GDF (GPU Dataframe), S3, GCS, Apache HDFS: https://docs.blazingdb.com/docs

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.
notes re: "A dataframe protocol for the PyData ecosystem" https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267/9

CSVW would be ideal for tabular data (with Linked Data metadata about the dataset and each column). More about this here: "Linked Data formats, tools, challenges, opportunities; CSVW, schema.org/Dataset, schema.org/ScholarlyArticle" https://discuss.ossdata.org/t/linked-data-formats-tools-challenges-opportunities-csvw-schema-org-dataset-schema-org-scholarlyarticle/160

jupyterlab / jupyterlab-data-explorer

Add converters for tabular data #63