manujosephv / pytorch_tabular

A standard framework for modelling Deep Learning Models for tabular data
https://pytorch-tabular.readthedocs.io/
MIT License
1.4k stars 140 forks source link

Working with huge datasets #456

Open santurini opened 5 months ago

santurini commented 5 months ago

Hello, more than a feature request this is an advice request.

I have to train a tabular model on a huge dataset (more than 10 million rows) and I am not able to fit it entirely into memory as a Dataframe. I would like to use the entire dataset for train/test/val without using a subset, and I wanted to know how would you suggest to operate in this case.

An alternative I've considered is to have a custom dataloader that loads into memory only the requested batch given a list of ids, but I don't know where to start and what should I actually modify or implement.

Some help would really be appreciated, thank you!

manujosephv commented 5 months ago

In the current way of doing things, this isn't supported.

There are a few things which stops this use-case

  1. Currently the data processing happens with pandas and numpy which means the categorical encoding, normalization etc all takes place in memory
  2. Let's consider that we managed to do all that processing in a lazy manner and turn off PyTorch Tabular's own handling of these items, then the dataset and dataloader needs to change.

For this to be done, we need to first move data processing to something like polars or spark which supports out-of-memory processing and then change the dataloaders to load only the batch we need into memory.

Different options available are:

  1. Adopt nvtabular from Merlin(NVIDIA) as the core data processing unit. (Pros: Easy to use framework, GPU data processing inbuilt, ready-to-use dataloaders, etc. Cons: Documentation isnt that great and library is not very mature)
  2. Adopt polars as the data processing library (Pros: Supports out-of-memory, uses all cores, blazing fast, etc. Cons: Will be difficult to work on truly huge data (100M rows and upwards)
  3. Adopt spark (Pros: Truly distributed, Cons: Too much overhead that for smaller usecases it is overkill)

Once the TabularDataModule is re-written, then making a dataloader which loads lazily is a simple task.

I, personally, don't have enough time on my hand to take on such a large undertaking. But I would gladly guide anyone who wants to pick it up.

santurini commented 5 months ago

Hello, I though about it this days. Is there the possibility to resume a training from a checkpoint? In this way I could split the dataset in smaller chunks and then resume the training when I load a different chunk.

Sorry if I bother you again

manujosephv commented 5 months ago

Of course. There is a save_model, and load_model if you want to save the entire model including datamodules etc. and there is save_weights and load_weights if you just want to save checkpoints. Documentation would have more details.

santurini commented 5 months ago

I read the documentation but I have some troubles in loading only the model and predictin a new row, in the sense that I get different results as I do not know how to process correctly the data to be passed to the model