scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Target Encoder Giving Nan values #340

Closed shauryauppal closed 2 years ago

shauryauppal commented 2 years ago

Expected Behavior

Target Encoder giving Nan values for few inputs

Same Issue:

PaulWestenthanner commented 2 years ago

Hi @shauryauppal Could you please also provide a dataset or even better a self-contained reproducible (minimal) example? Neither in the stackoverflow nor in the kaggle post the dataset is mentioned. (except for a reference to the kaggle housing prices competition which I can't seem to find)

glevv commented 2 years ago

Maybe sigmoid function is numerically unstable in some cases and using something like scipy.special.expit((stats['count'] - self.min_samples_leaf) / self.smoothing) in here https://github.com/scikit-learn-contrib/category_encoders/blob/02a20aa96c5f1f234ec89a0f781980622e3b193a/category_encoders/target_encoder.py#L170 could be beneficial, both in terms of speed and stability. It will introduce dependency on scipy though.

But without minimal reproducible example it's a needle in a haystack.

MR0205 commented 2 years ago

@GLevV

Numpy implements devision 0 by 0 as np.nan:

np.divide(0,0)

/tmp/ipykernel_2187/955440422.py:1: RuntimeWarning: invalid value encountered in true_divide
  np.divide(0,0)
nan

Thus if two conditions hold:

  1. self.min_samples_leaf == stats['count'] and
  2. self.smoothing == 0,

two variables 'smoove' and 'smoothing' in the formulas:

smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))
smoothing = prior * (1 - smoove) + stats['mean'] * smoove

would be equal to np.nan, giving the value np.nan for the category.

Note: We also need to take into account that current implementation has a bizzare line: smoothing[stats['count'] == 1] = prior that would prevent Nan value to appear for the category that only appears in a single line.

Example:

from category_encoders.target_encoder import TargetEncoder          
X = pd.DataFrame({'A': ['a', 'a']})
y = pd.Series([0, 1])
TargetEncoder(smoothing=0, min_samples_leaf=2).fit_transform(X, y)
#   A
#0  NaN
#1  NaN
PaulWestenthanner commented 2 years ago

neither stats["count"] nor self.smoothing should be 0. The former cannot even be 0 while for the second the documentation clearly states The value must be strictly bigger than 0. Without a reproducible example by @shauryauppal we cannot do anything here