Workflow for Tabular data

lhoestq commented 2 years ago

Tabular data are treated very differently than data for NLP, audio, vision, etc. and therefore the worflow for tabular data in datasets is not ideal.

For example for tabular data, it is common to use pandas/spark/dask to process the data, and then load the data into X and y (X is an array of features and y an array of labels), then train_test_split and finally feed the data to a machine learning model.

In datasets the workflow is different: we use load_dataset, then map, then train_test_split (if we only have a train split) and we end up with columnar dataset splits, not formatted as X and y.

Right now, it is already possible to convert a dataset from and to pandas, but there are still many things that could improve the workflow for tabular data:

be able to load the data into X and y
be able to load a dataset from the output of spark or dask (as far as I know it's usually csv or parquet files on S3/GCS/HDFS etc.)
support "unsplit" datasets explicitly, instead of putting everything in "train" by default

cc @adrinjalali @merveenoyan feel free to complete/correct this :)

Feel free to also share ideas of APIs that would be super intuitive in your opinion !

merveenoyan commented 2 years ago

I use below to load a dataset:

dataset = datasets.load_dataset("scikit-learn/auto-mpg")
df = pd.DataFrame(dataset["train"])

TBH as said, tabular folk split their own dataset, they sometimes have two splits, sometimes three. Maybe somehow avoiding it for tabular datasets might be good for later. (it's just UX improvement)

wconnell commented 1 year ago

is very slow batch access of a dataset (tabular, csv) with many columns to be expected?

lhoestq commented 1 year ago

Define "many" ? x)

wconnell commented 1 year ago

~20k! I was surprised batch loading with as few as 32 samples was really slow. I was speculating the columnar format was the cause -- or do you see good performance with this approx size of tabular data?

lhoestq commented 1 year ago

20k can be a lot for a columnar format but maybe we can optimize a few things.

It would be cool to profile the code to see if there's an unoptimized part of the code that slows everything down.

(it's also possible to kill the job when it accesses the batch, it often gives you the traceback at the location where the code was running)

adrinjalali commented 1 year ago

FWIW I've worked with tabular data with 540k columns.

wconnell commented 1 year ago

thats awesome, whats your secret? would love to see an example!

adrinjalali commented 1 year ago

@wconnell I'm not sure what you mean by my secret, I load them into a numpy array 😁

An example dataset is here which is a dataset of DNA methylation reads. This dataset is about 950 rows and 450k columns.

huggingface / datasets

Workflow for Tabular data #4456