scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Intercept in Contrast Coding Schemes #370

Open PaulWestenthanner opened 1 year ago

PaulWestenthanner commented 1 year ago

Expected Behavior

The constant (all values 1) intercept column should not be added when applying contrast coding schemes (i.e. backward difference, sum, polynomial and helmert coding)

I don't think this intercept column is needed. If you fit a supervised learning model it is probably gonna help to remove the intercept column. I think it is there because when fitting linear models with statsmodels you have to add the intercept.
However I don't like that the output of an encoder would then depend on whether the intercept column is already there or not, e.g. if I first apply encoder A on column A and then encoder B on column B the intercept column of B overwrite A's intercept column hence not adding a new column. Also if I have (for some reason) a column called intercept that is not constant it would get overwritten.

Any opinion? Am I missing something? Is the intercept necessary?

Actual Behavior

A constant column with all values 1 is added

Steps to Reproduce the Problem

Run transform on any fitted contrast coding encoder, e.g.

        train = ['A', 'B', 'C']
        encoder = encoders.BackwardDifferenceEncoder(handle_unknown='value', handle_missing='value')
        encoder.fit_transform(train)
glevv commented 1 year ago

Can intercept be added as a class parameter? If so, then this is the way to go, imo. Then these classes could be tested with different values of intercept to catch errors and bugs.

PaulWestenthanner commented 1 year ago

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

glevv commented 1 year ago

Yes we could I think, that should be rather straight forward as well. Would you set with_intercept=True as a default for backwards compatibility or not (which might be more correct)?

Yep, I would set it to True to keep the default behavior intact

glevv commented 1 year ago

Is this one still relevant?

PaulWestenthanner commented 1 year ago

I think it is, I'm slightly in favor of setting with_intercept to False as I don't think many people will need the intercept.
However I'd like to keep this issue open for a little while and hope for more voices from the community, maybe from actual users of the contrast coding schemes. I feel a little uneasy to take this decision by myself without further discussion

jacekkrawiec commented 1 year ago

user's perspective: imo encoders should do what their name says they do - encode existing columns. If there's a need to add an intercept, it has to be done consciously by the user. Especially for someone using statsmodels, where the default way is to use sm.add_constant. If you're in favor of keeping this feature as a parameter I'd suggest to make it False by default, otherwise I'd remove it completely.

PaulWestenthanner commented 1 year ago

Ok this seems to be a consensus, let's remove it - at least by default