scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 395 forks source link

Multi-hot encoding for ambiguous input #161

Open fullflu opened 5 years ago

fullflu commented 5 years ago

I propose to implement simple multi-hot encoding which allows ambiguous input and outputs non-negative value.

Let x_j be a realization of department of a student. Usually, we assume that x_j is defined without ambiguity, such as mathematics, physics, and so on. In real-world dataset, however, we sometimes only know the ambiguous value, such as sciences. I want to encode such ambiguous categorical features.

I implemented fit and transform function and tested them by several cases (WIP). I focused on the case in which ambiguous value is represented by a delimiter (as 'mathematics|physics', which means x_j is mathematics or physics).

I hope you to discuss the potential and usefulness of this type of implementation. I wonder if, at least, following updates are necessary:

The difference between multi-hot encoder and the related issues may be as follows:

The merit of multi-hot encoding is its simplicity and efficiency.

Soon I will send a pull request. I look forward to hearing from you.

janmotl commented 5 years ago

I wonder if, at least, following updates are necessary: implement inverse_transform function

Inverse transform is not necessary.

handle impute parameter

The functionality of this argument is in the process of overhaul -> at this moment, it can be ignored.

implement not only 'or' type delimiter, but also 'and' type delimiter

This is up to you.

reflect latest research in the field of data mining.

It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.

fullflu commented 5 years ago

Thank you for your response.

It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.

OK, I will write a blog that uses this encoder reflecting the discussion in the Pull-Request.