Closed shauryauppal closed 2 years ago
Hi @shauryauppal Could you please also provide a dataset or even better a self-contained reproducible (minimal) example? Neither in the stackoverflow nor in the kaggle post the dataset is mentioned. (except for a reference to the kaggle housing prices competition which I can't seem to find)
Maybe sigmoid function is numerically unstable in some cases and using something like
scipy.special.expit((stats['count'] - self.min_samples_leaf) / self.smoothing)
in here
https://github.com/scikit-learn-contrib/category_encoders/blob/02a20aa96c5f1f234ec89a0f781980622e3b193a/category_encoders/target_encoder.py#L170
could be beneficial, both in terms of speed and stability. It will introduce dependency on scipy though.
But without minimal reproducible example it's a needle in a haystack.
@GLevV
Numpy implements devision 0 by 0 as np.nan:
np.divide(0,0)
/tmp/ipykernel_2187/955440422.py:1: RuntimeWarning: invalid value encountered in true_divide
np.divide(0,0)
nan
Thus if two conditions hold:
two variables 'smoove' and 'smoothing' in the formulas:
smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))
smoothing = prior * (1 - smoove) + stats['mean'] * smoove
would be equal to np.nan, giving the value np.nan for the category.
Note: We also need to take into account that current implementation has a bizzare line:
smoothing[stats['count'] == 1] = prior
that would prevent Nan value to appear for the category that only appears in a single line.
Example:
from category_encoders.target_encoder import TargetEncoder
X = pd.DataFrame({'A': ['a', 'a']})
y = pd.Series([0, 1])
TargetEncoder(smoothing=0, min_samples_leaf=2).fit_transform(X, y)
# A
#0 NaN
#1 NaN
neither stats["count"]
nor self.smoothing
should be 0. The former cannot even be 0 while for the second the documentation clearly states The value must be strictly bigger than 0
. Without a reproducible example by @shauryauppal we cannot do anything here
Expected Behavior
Target Encoder giving Nan values for few inputs
Same Issue:
https://www.kaggle.com/questions-and-answers/204970