mindsdb / lightwood

Lightwood is Legos for Machine Learning.
GNU General Public License v3.0
442 stars 93 forks source link

[ENH] Handle n_classes > 2 in binary encoder as unknowns #1151

Closed paxcema closed 1 year ago

paxcema commented 1 year ago

The reason for this: due to sampling, categorical columns with weird distributions can be seen as binary by type_infer. This isn't really a bug, it's the nature of using samples for type inference. This happened recently during internal testing with the OpenML "micro mass" dataset.

Anyway, if this edge case triggers, our current implementation of the binary encoder will fail. This PR adds a slight fix so that it doesn't. The new (unseen) classes won't really map out to anything, so the user is still nudged towards overriding and using a multiclass encoder, but at least the predictor is able to finish training successfully.