sb-ai-lab / Py-Boost

Python based GBDT implementation on GPU. Efficient multioutput (multiclass/multilabel/multitask) training
Apache License 2.0
157 stars 13 forks source link

how your code performing on sparse data #16

Open Sandy4321 opened 1 year ago

Sandy4321 commented 1 year ago

seems to be GBM are bad on sparse data for classification how your code performing on sparse data NLP one hot data is very sparse , let say 98% of data are zeros?

btbpanda commented 1 year ago

Hi @Sandy4321

Thanks for your question. You are right, all GBMs by design are probably not the best choice for dealing with the sparse data. Even thought some of SotA implementations such as LightGBM or XGBoost support the sparse format and implement specific features for this data type, performance may be less than neural networks or even linear models. But it actually depends on the task - each problem are individual and only experiment will show you what is the best. However, unfortunately, py-boost has no built-in support of sparse arrays. To handle it, you should manually convert it to the dense array. We have a plan to support sparsity for both features and targets, but don't expect it will be released soon. But some optimizations could be made here. All of them could save some memory and prevent overfitting. Some times it could be enough to fit GPU memory, if dataset is not large:

1) limit max_bin to 8-16 or even 4 2) limit colsample to 0.1-0.2 3) limit max_depth to 3-4

But in general, common approach to train GBM over sparse representation will be dimension reduction (via SVD for example) before training or using another representation than BoW/tf-idf. Typically I expect a better performance from both approaches regardless of GBM implementation, espessialy for NLP task where we have a lot of pretrained language models.

Sandy4321 commented 1 year ago

"You are right, all GBMs by design are probably not the best choice for dealing with the sparse data." can you share some link i glad you understand it, may support your opinion by evidence , since as rule people do not aware about such an issue

i can not find any serious web link to persuade these people, that GBM is bad on sparse data ?