scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

[DOC] Gaussian noise regularization: multiplicative or additive? #333

Open cmougan opened 2 years ago

cmougan commented 2 years ago

In the documentatin of the gaussia noise regularization it says

adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). Sigma gives the standard deviation (spread or "width") of the normal distribution. The optimal value is commonly between 0.05 and 0.6. The default is to not add noise, but that leads to significantly suboptimal results.

But then the operation is a multiplication:

if self.sigma is not None and y is not None:
    X[col] = X[col] * random_state_.normal(1., self.sigma, X[col].shape[0])

I am not sure it this should be like this. Noise is less relevant when close to 0. Also it creates negative categories very soon.

Would this be best?

X[col] = X[col] + random_state_.normal(0, self.sigma, X[col].shape[0])

In case, its better as an addition, then the sigma parameter needs to be changed for regression task and not be limited to 1.

PaulWestenthanner commented 2 years ago

Hi Carlos, I'm not exactly sure what your point of concern is here. Is it

  1. a wording issue: Since it says add noise the noise has to be additive? I'm not a native English speaker, but to me add noise can also mean adding noise in a multiplicative fashion.
  2. a technical issue: Adding noise makes more sense to be additive.

In case of 1, I think it might be good enough to clarify in the documentation.
In case of 2: LOO and target encoding is for regression and binary classification problems. The noise is added to the encoded values (i.e. some variation of category means). For binary classification I fully agree that additive noise makes more sense precisely because of the issue around 0 that you're pointing out. For regression problem I can imagine scenarios where it might make sense - dependent on the nature of the problem - to add multiplicative or additive noise. Also the regressor used afterwards might change whether you want additive or multiplicative noise. I don't think there is a one-size-fits-it-all solution

cmougan commented 2 years ago

Hi Paul!

Apparently, some researchers will expect if you say additive to be "+". But overall I believe it's better with the current way. --there is no one-size-fits-all and it does not make sense to add one hyperparameter more for this. Actually, "recent work" has provided some empirical evidence that Smoothing might be in general better.

If there is nicer wording, it might be better to clarify. Some suggestions might be: includes, incorporates