Multi-hot encoding for ambiguous input

fullflu commented 5 years ago

I propose to implement simple multi-hot encoding which allows ambiguous input and outputs non-negative value.

Let x_j be a realization of department of a student. Usually, we assume that x_j is defined without ambiguity, such as mathematics, physics, and so on. In real-world dataset, however, we sometimes only know the ambiguous value, such as sciences. I want to encode such ambiguous categorical features.

I implemented fit and transform function and tested them by several cases (WIP). I focused on the case in which ambiguous value is represented by a delimiter (as 'mathematics|physics', which means x_j is mathematics or physics).

I hope you to discuss the potential and usefulness of this type of implementation. I wonder if, at least, following updates are necessary:

implement inverse_transform function
handle impute parameter
implement not only 'or' type delimiter, but also 'and' type delimiter
reflect latest research in the field of data mining.

The difference between multi-hot encoder and the related issues may be as follows:

77 : I simply encode ambiguous|dirty feature as well as one-hot encoding. I did not consider the similarity.
136 : Target encoding may be useful for ambiguous|dirty feature, but I focus on simple multi-hot encoding.

The merit of multi-hot encoding is its simplicity and efficiency.

Simplicity: We only need to insert delimiter string into dirty categorical feature.
Efficiency: If we prepare a mapping which represents relationships between ambiguous|dirty categories and feature without ambiguity, we will need a lot of memory capacity for j-th feature (O(2^C_j), where C_j is the cardinality of j-th feature). Multi-hot encoding needs O(C_j) memory.

Soon I will send a pull request. I look forward to hearing from you.

janmotl commented 5 years ago

I wonder if, at least, following updates are necessary: implement inverse_transform function

Inverse transform is not necessary.

handle impute parameter

The functionality of this argument is in the process of overhaul -> at this moment, it can be ignored.

implement not only 'or' type delimiter, but also 'and' type delimiter

This is up to you.

reflect latest research in the field of data mining.

It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.

fullflu commented 5 years ago

Thank you for your response.

It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.

OK, I will write a blog that uses this encoder reflecting the discussion in the Pull-Request.

scikit-learn-contrib / category_encoders

Multi-hot encoding for ambiguous input #161

77 : I simply encode ambiguous|dirty feature as well as one-hot encoding. I did not consider the similarity.

136 : Target encoding may be useful for ambiguous|dirty feature, but I focus on simple multi-hot encoding.