Open santurini opened 5 months ago
In the current way of doing things, this isn't supported.
There are a few things which stops this use-case
For this to be done, we need to first move data processing to something like polars or spark which supports out-of-memory processing and then change the dataloaders to load only the batch we need into memory.
Different options available are:
nvtabular
from Merlin(NVIDIA) as the core data processing unit. (Pros: Easy to use framework, GPU data processing inbuilt, ready-to-use dataloaders, etc. Cons: Documentation isnt that great and library is not very mature)Once the TabularDataModule
is re-written, then making a dataloader which loads lazily is a simple task.
I, personally, don't have enough time on my hand to take on such a large undertaking. But I would gladly guide anyone who wants to pick it up.
Hello, I though about it this days. Is there the possibility to resume a training from a checkpoint? In this way I could split the dataset in smaller chunks and then resume the training when I load a different chunk.
Sorry if I bother you again
Of course. There is a save_model
, and load_model
if you want to save the entire model including datamodules etc. and there is save_weights
and load_weights
if you just want to save checkpoints. Documentation would have more details.
I read the documentation but I have some troubles in loading only the model and predictin a new row, in the sense that I get different results as I do not know how to process correctly the data to be passed to the model
Hello, more than a feature request this is an advice request.
I have to train a tabular model on a huge dataset (more than 10 million rows) and I am not able to fit it entirely into memory as a Dataframe. I would like to use the entire dataset for train/test/val without using a subset, and I wanted to know how would you suggest to operate in this case.
An alternative I've considered is to have a custom dataloader that loads into memory only the requested batch given a list of ids, but I don't know where to start and what should I actually modify or implement.
Some help would really be appreciated, thank you!