Open manujosephv opened 8 months ago
Thank you for creating the issue. Hoping for quick resolution and addition of support for spark dataframe
Hi @manujosephv! I am currently working on a replacement of TabularDataModule
for my own use case loading larger than memory datasets from outside sources, and I have a question related to this current issue.
What exactly is the use of the cache_data
functionality? The other parameters here are well documented, but the use of the cache is a bit unclear to me. Is it to avoid performing the data transformations repeatedly during learning? If so, are there any benchmark results comparing the the performance drawbacks of performing each data transformation for each batch during learning?
That's awesome. I hope you can contribute it back in here when you have it working...
And cache_data is a parameter I very recently added. By default the datamodule holds on to the raw data as attributes. And while saving the model, we also save the datamodule. For very large datasets, that poses a problem.
cache_data was added to enable the used to choose where to save the data (in memory, on disk, or not at all).
In your case, I think we can ignore that param and functionality because if the dataset is considered out of memory, then this whole functionality isn't needed anymore.
I will get back to you if I figure it out :)
There's a few tricky parts like the transforms which require access to the data. I haven't looked at all the available transforms yet, but a naive option for very large datasets is to sample from the external data source for an approximation dataset that is used to fit the transforms, then use the .transform()
when the data stream is acquired by DataLoader
during training. Depending on the sampling the approximation dataset may not be representative, but this is up to the user to decide.
There are more comprehensive options, such as letting the user provide already fitted data transformation objects, however I am not sure how well this fits with the philosophy of least friction in this project
Is your feature request related to a problem? Please describe. When the data size is quite large, many times we might need to use larger than RAM data. Also, using an engine like Polars will speed things up a lot.
Describe the solution you'd like Re-write Datamodule to be more performant. Out of core processing like SparkDataframe or Polars combined with NVTabular might be a good solution.
Describe alternatives you've considered Currently its impossible to load larger than memory datasets