Open fullflu opened 5 years ago
I wonder if, at least, following updates are necessary: implement inverse_transform function
Inverse transform is not necessary.
handle impute parameter
The functionality of this argument is in the process of overhaul -> at this moment, it can be ignored.
implement not only 'or' type delimiter, but also 'and' type delimiter
This is up to you.
reflect latest research in the field of data mining.
It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.
Thank you for your response.
It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.
OK, I will write a blog that uses this encoder reflecting the discussion in the Pull-Request.
I propose to implement simple multi-hot encoding which allows ambiguous input and outputs non-negative value.
Let x_j be a realization of department of a student. Usually, we assume that x_j is defined without ambiguity, such as mathematics, physics, and so on. In real-world dataset, however, we sometimes only know the ambiguous value, such as sciences. I want to encode such ambiguous categorical features.
I implemented fit and transform function and tested them by several cases (WIP). I focused on the case in which ambiguous value is represented by a delimiter (as 'mathematics|physics', which means x_j is mathematics or physics).
I hope you to discuss the potential and usefulness of this type of implementation. I wonder if, at least, following updates are necessary:
The difference between multi-hot encoder and the related issues may be as follows:
77 : I simply encode ambiguous|dirty feature as well as one-hot encoding. I did not consider the similarity.
136 : Target encoding may be useful for ambiguous|dirty feature, but I focus on simple multi-hot encoding.
The merit of multi-hot encoding is its simplicity and efficiency.
Soon I will send a pull request. I look forward to hearing from you.