scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Fix BaseNEncoder number of output columns #296

Closed kerrickstaley closed 3 years ago

kerrickstaley commented 3 years ago

BaseNEncoder encoder used an incorrect formula for calculating the number of required bits in the output. If there are nvals distinct values and we reserve one encoding to represent "missing or unknown", then the correct number of bits is ceil(log(nvals + 1, base)). However, the code was previously using the formula ceil(log(nvals, base)) + 1.

Fixes https://github.com/scikit-learn-contrib/category_encoders/issues/264

Proposed Changes

kerrickstaley commented 3 years ago

@janmotl @wdm0006 this is ready for review, could you take a look?

wdm0006 commented 3 years ago

Looking back at old PRs, this looks good to me but looks like the test suite failed. The logs arent available anymore, could you pull in any recent changes from master and push again to re-run? Thanks

wdm0006 commented 3 years ago

Looks like the basen encoding tests are passing fine but theres an issue in master for the GLM encoding unrelated to this PR. Going to go ahead and merge this, thanks for the patience.