Fix BaseNEncoder number of output columns

kerrickstaley commented 3 years ago

BaseNEncoder encoder used an incorrect formula for calculating the number of required bits in the output. If there are nvals distinct values and we reserve one encoding to represent "missing or unknown", then the correct number of bits is ceil(log(nvals + 1, base)). However, the code was previously using the formula ceil(log(nvals, base)) + 1.

Fixes https://github.com/scikit-learn-contrib/category_encoders/issues/264

Proposed Changes

Change the formula to ceil(log(nvals + 1, base)).
Switch the formula to use integer math so we don't have to worry about floating point rounding errors.
Add a test.
Fix a non-deterministic test.

kerrickstaley commented 3 years ago

@janmotl @wdm0006 this is ready for review, could you take a look?

wdm0006 commented 3 years ago

Looking back at old PRs, this looks good to me but looks like the test suite failed. The logs arent available anymore, could you pull in any recent changes from master and push again to re-run? Thanks

wdm0006 commented 3 years ago

Looks like the basen encoding tests are passing fine but theres an issue in master for the GLM encoding unrelated to this PR. Going to go ahead and merge this, thanks for the patience.

scikit-learn-contrib / category_encoders

Fix BaseNEncoder number of output columns #296

Proposed Changes