google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

[Question] Training data too big to fit in memory #108

Open chatnord opened 1 week ago

chatnord commented 1 week ago

Hi,

thanks for your work guys! I am trying to explore using your implementation for our use-case, but I am stuck a bit on how you would deal with cases where the training set is too big to fit in memory. With normal tensorflow, we usually have a "from_generator" implementation, and we only read one batch at a time. I was reading through your documentation today, and I am not sure of how you would proceed here.

Can someone please point me to any relevant information?

Thanks a lot!

achoum commented 1 week ago

Hi Chatnord,

If the training dataset is too large to fit in memory, there are essentially two families of solutions.

1. The first solution is to approximate the training.

The simplest solution is to train on less data. Training on less data can lead to worst models, but this is not always the case and can be evaluated empirically. For example, if your dataset contains 1B examples but you can only fit 10M examples in memory. Try training a model with 9M examples and with 10M examples. If both models perform the same, this is a good indication that having more data won't improve significantly the model quality.

Another solution is to train multiple models on different subset of data and then to ensemble them. This is generally easy to do by hand. The subset sampling can be done on the examples (e.g., sample a subset of examples) or on the features (train each model on all the examples but a subset of features).

2. If you need to train on more data, you can use distributed training: https://ydf.readthedocs.io/en/latest/tutorial/distributed_training/

We should publish example of YDF distribute training in Google Cloud soon.

YDF distributed training distributes both the data and the computation. So, if one machine can store 10M examples, you can train on 100M examples using 10 machines with the same speed, and 1B examples using 100 machines.

3. Currently, the most memory efficient way to feed examples for in-process training (i.e., non-distributed training) is to use numpy arrays. Using Tensorflow data generator works, but this is not as efficient.

How large if your dataset? (number of examples and number of features).

Hope this helps. M.

chatnord commented 1 week ago

Hey M, thanks for answering (I think both here and on the forum :)), you have been of HUGE help. Ultimately, I understand that if you cannot use TFRecords or any way to stream the input dataset (makes sense), so I think that I will try with solution #1 in the meantime. Thanks a lot

TonyCongqianWang commented 5 days ago

I also have a problem with memory but my data does fit into Memory. I supply both a Training and a Validation set (around 40GB in total). When I run the training, the process uses way to much memory (400GB) until it gets killed. Initially I had more than one parallel try, but I reduced it to 1 after the process got killed the first time.