Working with huge datasets

santurini commented 3 weeks ago

Hello, more than a feature request this is an advice request.

I have to train a tabular model on a huge dataset (more than 10 million rows) and I am not able to fit it entirely into memory as a Dataframe. I would like to use the entire dataset for train/test/val without using a subset, and I wanted to know how would you suggest to operate in this case.

An alternative I've considered is to have a custom dataloader that loads into memory only the requested batch given a list of ids, but I don't know where to start and what should I actually modify or implement.

Some help would really be appreciated, thank you!

manujosephv commented 3 weeks ago

In the current way of doing things, this isn't supported.

There are a few things which stops this use-case

Currently the data processing happens with pandas and numpy which means the categorical encoding, normalization etc all takes place in memory
Let's consider that we managed to do all that processing in a lazy manner and turn off PyTorch Tabular's own handling of these items, then the dataset and dataloader needs to change.

For this to be done, we need to first move data processing to something like polars or spark which supports out-of-memory processing and then change the dataloaders to load only the batch we need into memory.

Different options available are:

Adopt nvtabular from Merlin(NVIDIA) as the core data processing unit. (Pros: Easy to use framework, GPU data processing inbuilt, ready-to-use dataloaders, etc. Cons: Documentation isnt that great and library is not very mature)
Adopt polars as the data processing library (Pros: Supports out-of-memory, uses all cores, blazing fast, etc. Cons: Will be difficult to work on truly huge data (100M rows and upwards)
Adopt spark (Pros: Truly distributed, Cons: Too much overhead that for smaller usecases it is overkill)

Once the TabularDataModule is re-written, then making a dataloader which loads lazily is a simple task.

I, personally, don't have enough time on my hand to take on such a large undertaking. But I would gladly guide anyone who wants to pick it up.

santurini commented 2 weeks ago

Hello, I though about it this days. Is there the possibility to resume a training from a checkpoint? In this way I could split the dataset in smaller chunks and then resume the training when I load a different chunk.

Sorry if I bother you again

manujosephv commented 2 weeks ago

Of course. There is a save_model, and load_model if you want to save the entire model including datamodules etc. and there is save_weights and load_weights if you just want to save checkpoints. Documentation would have more details.

santurini commented 2 weeks ago

I read the documentation but I have some troubles in loading only the model and predictin a new row, in the sense that I get different results as I do not know how to process correctly the data to be passed to the model

manujosephv / pytorch_tabular

Working with huge datasets #456