scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

catboost encoder get different result with catboost #436

Closed ccylance closed 4 months ago

ccylance commented 4 months ago

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Hi, I noticed CatBoostEncoder supports time-aware encoding. However, when I do testing, for the same object in category, CatBoostEncoder return the same result for different rows, while catboost not. For example, suppose we have feature named color with value blue, red, green and blue.
CatBoostEncoder may return the result with 0.2, 0.4, 0.4,0.2. We can see the value blue with get the same value. But in catboost it will return 0.2, 0.4, 0.4,0.25. The value of blue will be affected by the order.

bmreiniger commented 4 months ago

Can you provide code with your example?

Note that fit_transform should produce different values per category, whereas transform should not. (Edit: transform with y=None should not, but if y is provided, then it should behave the same as fit_transform.)

ccylance commented 4 months ago

Here are my code image As the code showed, different categories will have different results. But in CatBoost, the same category with different orders will also produce different results.

bmreiniger commented 4 months ago

Thanks for the example! (In the future, providing code as formatted text is more helpful: people can copy and paste to quickly retry what you're showing.)

This demonstrates what I was alluding to in my second paragraph: this is expected behavior. If you print the results of the penultimate line (fit_transform), you'll see different values within each category. You should get the same output if you change the last line to cbe_encoder.transform(df1['f1'], df1['label']). Using fit_transform, or transform with y specified, tells the package that you're transforming the training dataset, and so it takes the sliding transformation that CatBoost is known for. On the other hand, when you transform with y=None, the package takes that as meaning you're transforming the test set, and so fixed values per category are used (roughly, the mean target from the entire training set). See the NOTE at the end of the docstring.

ccylance commented 4 months ago

Very clear! Appreciate for your reply!

ccylance commented 3 months ago

@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

bmreiniger commented 3 months ago

@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

I'm not entirely sure, in particular what the CatBoost source's definitions of those terms are. But they seem to be just two (probably different but similar) ways to regularize/smooth the raw mean-target-so-far. The mean here is the global mean, a sensible default for a prior. And note that the - y part is just to remove the row's own contribution from pandas's cumulative sum.

ccylance commented 3 months ago

@bmreiniger
When conducting a regression task, I observed that CatBoost implements bucketing for labels, which is not the case here. Is the bucketing process necessary in this context?

bmreiniger commented 3 months ago

I don't see why it would be, but maybe bucketing (depending on how you assign the label then) acts as another source of regularization? Or maybe it's just faster? Can you link to their source that performs bucketing? (Maybe better to ask over there.)