The reason for this: due to sampling, categorical columns with weird distributions can be seen as binary by type_infer. This isn't really a bug, it's the nature of using samples for type inference. This happened recently during internal testing with the OpenML "micro mass" dataset.
Anyway, if this edge case triggers, our current implementation of the binary encoder will fail. This PR adds a slight fix so that it doesn't. The new (unseen) classes won't really map out to anything, so the user is still nudged towards overriding and using a multiclass encoder, but at least the predictor is able to finish training successfully.
The reason for this: due to sampling, categorical columns with weird distributions can be seen as binary by
type_infer
. This isn't really a bug, it's the nature of using samples for type inference. This happened recently during internal testing with the OpenML "micro mass" dataset.Anyway, if this edge case triggers, our current implementation of the binary encoder will fail. This PR adds a slight fix so that it doesn't. The new (unseen) classes won't really map out to anything, so the user is still nudged towards overriding and using a multiclass encoder, but at least the predictor is able to finish training successfully.