Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes

manujosephv / pytorch_tabular

A standard framework for modelling Deep Learning Models for tabular data

https://pytorch-tabular.readthedocs.io/

MIT License

1.4k stars 141 forks source link

Re-write DataModule from scratch enabling support for Spark DataFrames, Polars, and larger than memory dataframes #402

Open manujosephv opened 10 months ago

manujosephv commented 10 months ago

Is your feature request related to a problem? Please describe. When the data size is quite large, many times we might need to use larger than RAM data. Also, using an engine like Polars will speed things up a lot.

Describe the solution you'd like Re-write Datamodule to be more performant. Out of core processing like SparkDataframe or Polars combined with NVTabular might be a good solution.

Describe alternatives you've considered Currently its impossible to load larger than memory datasets

saankhya-mondal commented 10 months ago

Thank you for creating the issue. Hoping for quick resolution and addition of support for spark dataframe

huylenguyen commented 7 months ago

Hi @manujosephv! I am currently working on a replacement of TabularDataModule for my own use case loading larger than memory datasets from outside sources, and I have a question related to this current issue.

What exactly is the use of the cache_data functionality? The other parameters here are well documented, but the use of the cache is a bit unclear to me. Is it to avoid performing the data transformations repeatedly during learning? If so, are there any benchmark results comparing the the performance drawbacks of performing each data transformation for each batch during learning?

manujosephv commented 7 months ago

That's awesome. I hope you can contribute it back in here when you have it working...

And cache_data is a parameter I very recently added. By default the datamodule holds on to the raw data as attributes. And while saving the model, we also save the datamodule. For very large datasets, that poses a problem.

cache_data was added to enable the used to choose where to save the data (in memory, on disk, or not at all).

In your case, I think we can ignore that param and functionality because if the dataset is considered out of memory, then this whole functionality isn't needed anymore.

huylenguyen commented 7 months ago

I will get back to you if I figure it out :)

There's a few tricky parts like the transforms which require access to the data. I haven't looked at all the available transforms yet, but a naive option for very large datasets is to sample from the external data source for an approximation dataset that is used to fit the transforms, then use the .transform() when the data stream is acquired by DataLoader during training. Depending on the sampling the approximation dataset may not be representative, but this is up to the user to decide.

There are more comprehensive options, such as letting the user provide already fitted data transformation objects, however I am not sure how well this fits with the philosophy of least friction in this project