sb-ai-lab / Py-Boost

Python based GBDT implementation on GPU. Efficient multioutput (multiclass/multilabel/multitask) training
Apache License 2.0
157 stars 13 forks source link

Unexpected out of memory error #8

Open adkinsty opened 1 year ago

adkinsty commented 1 year ago

I am getting an out of memory error when trying to train a sketchboost model on a somewhat large dataset.

The shape of the training set is (1348244, 320) and the shape of the test set is (674022, 320).

The total data size is <3gb and I have 24gb of GPU memory, yet when training starts, cupy tries to allocate >20gb of memory, and it runs out of memory.

I am using cupy-cuda11x 11.6.0 and py-boost 0.4.1.

I am using the following model configuration:

 model = SketchBoost(
      loss='crossentropy',
      verbose=1,
      ntrees=10000,
      es=100,
      lr=0.13,
      max_depth=10,
      min_gain_to_split=0.21
  )

Here's the error I'm getting:

... line 110, in training    model.fit(x_train, y_train, eval_sets=[{'X': x_test, 'y': y_test}])
--
File "/opt/conda/lib/python3.8/site-packages/py_boost/gpu/boosting.py", line 260, in fit
builder, build_info = self._create_build_info(mempool, X, X_enc, y, sample_weight,  File "/opt/conda/lib/python3.8/site-packages/py_boost/gpu/boosting.py", line 322, in _create_build_info
val_ens = [cp.empty((x.shape[0], self.base_score.shape[0]), order='C') for x in y_val]  File "/opt/conda/lib/python3.8/site-packages/py_boost/gpu/boosting.py", line 322, in <listcomp>    val_ens = [cp.empty((x.shape[0], self.base_score.shape[0]), order='C') for x in y_val]  File "/opt/conda/lib/python3.8/site-packages/cupy/_creation/basic.py", line 22, in empty
return cupy.ndarray(shape, dtype, order=order)  File "cupy/_core/core.pyx", line 136, in cupy._core.core.ndarray.__new__
File "cupy/_core/core.pyx", line 224, in cupy._core.core._ndarray_base._init
File "cupy/cuda/memory.pyx", line 742, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1419, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1440, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1120, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1141, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 1379, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 21,191,251,968 bytes (allocated so far: 21,838,152,704 bytes).

Do you know why this might be happening or what I could do to fix this? For example, could I limit the amount of memory that cupy tries to allocate? Or could I use multiple GPUs for training?

btbpanda commented 1 year ago

@adkinsty According to what you provide it is unobvious, why OOM occurs. I assume you are trying to train multiclass task, since loss is 'crossentropy'. When you fit GBDT your features matrix will be quantized and stored in uint8 array, so memory usage will be reduced compared with initial CPU arrays. And your train and valid matrices together will allocate less than 1 GB GPU memory. Here are some thoughts about what may happend:

1) Problem is about the targets. Py-Boost will allocate on GPU 4 fp32 arrays of size n_samples * n_classes, so if your n_classes is extremly large, OOM is possible. Please provide, how many classes are in the target array? And what is the shape of target array?

2) n_classes is low, but classes labels are incorrect. Classes in the target array should be labeled with ints from 0 to n_class - 1. So, for example, if you have 3 classes - [0, 100, 10000], Py-Boost decides, that you have 10001 classes

3) Some of your GPU memory is already in use by another application. Please shut down the session and check nvidia-smi output

In general, this is a rule to estimate how much GPU memory you need to train: you will need approximately n_samples n_features bytes + n_samples n_outputs * 16 bytes + eps. Eps depends on the set up, but probably it is about 1-2GB. So, if according this rule you should fit - we will investigate what goes wrong. Otherwise, the only option for you to train is to perform a downsampling.

Any kind of distributed training such as multi GPU unfortunately is unavailable now. We have a plan to add this feature of course, but do not expect it will be releazed very soon.

adkinsty commented 1 year ago

Thank you for your detailed and helpful response.

This is a multi-class classification problem and the number of classes is extremely large (~5k). The target array is 1D with values ranging from 0 to n_classes-1.

Number of features is about 315 and number of samples (total) is around 2e6. According to your formula, I think I would need >100gb of memory (assuming n_classes == n_outputs). So, it would seem that I need to downsample the training set if I use SketchBoost.

btbpanda commented 1 year ago

@adkinsty There are some options to train in your case by splitting this big model into smaller ones or by truncating the number of classes.

Ones I trained multitask regression with 20k outputs. I splited big model into 10 2k output models and that worked not bad. But the better trick in that case was to reduce input dimensions before training via TruncatedSVD to about 500, than train SketchBoost and then making the inverse transform of the prediction. But I believe this trick is very task specific, while splitting the model is more universal trick.

In your case it is more complex because of multiclass - outputs are not independent. So what I think you should try:

1) Check the target distribution. I believe you find a lot of classes with just 5-10 examples or even less. If it is so, it is to small amount for building the model, and rare classes could be joint together by creating a one class, let's call it Other.

2) If it is not the option for you - you can make wrong but may be useful assumtion that outputs are independent, transform your problem from multiclass to multilabel by creating one hot encoded target matrix and swithing the loss to BCE and then just split the outputs randomly to the size that fits into GPU memory, I think 5-6 partitions will fit one by one, and train multiple models. After prediction you will just need to postprocess outputs to sum in 1 - may be just divide the values by row sum

3) It could be some times that classes have hierarchy. If it is so for you - great. You can train 2nd level models - first model predicts the upper level classification and a set of 2nd level models will specify it.

4) If nothing at all works for you and downsampling is your only choice, you can do the ensembling - split your data into N partitions by rows, train independent multiclass models and for inference just average the predictions - it definetly will be better than just loosing the part of the data

If you are going to check some of the approaches I will appreciate to have a feedback - what does work in your case and how it is compared with another baselines and pure downsampling