scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Catboost fit_transform method is broken. #351

Closed PraveshKoirala closed 1 year ago

PraveshKoirala commented 2 years ago

TLDR

When called fit_transform, the output is shuffled and not in sync with the input. This does not occur when called transform() after fitting. This tidbit should either be explained in documentation or should be solved to give the expected (i.e. ordered) results (order being values in the same order as the categories).

Example

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats)

gives

image

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 
                     'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 
                     'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit_transform(cats, y, return_df=True)

gives (NOTICE DIFFERENT VALUES FOR TWO a's): image

PaulWestenthanner commented 2 years ago

Hi @PraveshKoirala

this is not a bug. fit_transform calls transform(X, y) with the target information. As stated in catboost transform documentation

y : array-like, shape = [n_samples] when transform by leave one out None, when transform without target information (such as transform test set)

This always leaves out the current value. Hence we expect to see some differences. Indeed

cats = pd.DataFrame({'X':['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'id': ['a', 'a', 'd', 'd', 'd', 'f', 'f'], 'y' : [0, 0, 1, 0, 1, 1, 1]})
encoder = ce.CatBoostEncoder(cols=['X'])
encoder.fit(cats, y, return_df=True)
encoder.transform(cats, y)

gives the same result as fit_transform. Does this make sense for you?

PaulWestenthanner commented 2 years ago

I'm not quite sure though why our implementation uses this cumsum and cumcount. With this the output is dependent on the ordering of the input. I'm not super deep into catboost algorithm but I know that our implementation differs at some points from the catboost paper (and the "official" yandex implementation). Feel free to dig into it if you have time. These should be the relevant lines: https://github.com/scikit-learn-contrib/category_encoders/blob/12e20486f4422a56c802a0e04163a896271d4107/category_encoders/cat_boost.py#L269-L280

glevv commented 2 years ago

@PaulWestenthanner your question is connected to #337. cumsum and cumcount introduce dependence on sorting, that's why category_encoders existing implementation of CatBoostEncoder is time-aware implementation, thus data should be sorted according to datetime column. I guess this fact should be mentioned in docs. And if data does not have time or does not time-sorted it still should work fine (as written in comments in code). Earlier implementation of CatBoostEncoder used LOO scheme to permute data, so it wasn't depended on sorting (time-unaware or has_time=False in CatBoost).

glevv commented 1 year ago

I think we should update CatBoost documentation with this and #337 taken into account