pls clarify categorical values can be used or not - in paper written only numeric features

Sandy4321 commented 1 year ago

remove unnecessary for our purposes features presented in modern gradient boosting toolkits (for example, categorical data handling). The complete list of these limitations is the following. Py-Boost supports: (a) computations only on GPU, (b) only the depth-wise tree growth policy, (c) only numeric features (with possibly NaN values), and (d) only histogram algorithm for split search (maximum number of bins for each feature is limited to 256) https://arxiv.org/pdf/2211.12858.pdf

btbpanda commented 1 year ago

Hi @Sandy4321

Thanks for your question. Yes, as it is mentioned in the paper, categorical data types are not supported and we have no plan to implement it in the near future. Here are some of my thoughts about why we don't need that: we can separate the methods to handle categories into 2 types:

1) Target independent. This methods don't use the target while encoding a categories. Here we have for example Label Encoder, Frequency Encoder, One Hot Encoder, Embeddings provided by other model. All those methods (except One Hot Encoder) are good candidates to try with GBMs. But actually it doesn't matter, if you apply it manually before training or it will be hidden inside the toolbox so we left this work to the user while trying to solve more challenging tasks, because of the limited resources.

2) Target encoders. This type includes all the methods that encodes the categories with the target or gradient statistics. Since py-boost is the toolbox mainly used for the research of multioutput training, it is quite unobvious how to implement, for example, encoding the category with the average gradient value. Imagine, you have 100 class classification, so your encoder will produce 100 features for 1 category feature. It is definetly unefficient and probably is a way to overfitting. So first, we need to find out a way how to do it, and it is an area for the further research which is out of our scope now.

To sum it up: if you are trying to fit a simple binary classification or regression task, consider to use SotA implementation, for example CatBoost authors made a lot of research on category encoding. If you are training multioutput task, probably one of the methods that mentioned in 1 is a better choice for you together with py-boost.

Sandy4321 commented 1 year ago

I see thanks

sb-ai-lab / Py-Boost

pls clarify categorical values can be used or not - in paper written only numeric features #15