scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

[Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting? #440

Open HWiese1980 opened 1 month ago

HWiese1980 commented 1 month ago

We have a big data frame that we want to fit into a CountEncoder. We would like to somehow make use of the multiple cores of our machine. We would do that by splitting the DF into multiple chunks and fit (among other things) the CountEncoder on the chunks.

Now, after that the single CountEncoder objects have to be joined into one big CountEncoder as if it was fitted on the whole data frame.

Can this be done? If yes, how can we do that?

PaulWestenthanner commented 1 month ago

this is not supported out of the box. Are you planning to use the countencoder with normalize=True? Would it be possible to fit on a random subset only? I'd expect the results to be similar to the whole dataset. If you want to go for the full data set you need to implement something yourself. If you fit multiple CountEncoders make sure they all use the same OrdinalEncoder (the count encoder first fits an OrdinalEncoder to encode e.g. "foo", "bar" to 1, 2 and hence standardize the input. You'd want to pass that fitted OrdinalEncoder in the init rather than fit it in the fit function. Writing a combine function that adds up the counts should be rather straight forward then.