rhiever / datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.
MIT License
1.06k stars 204 forks source link

Integrate more encoding options for object columns #3

Closed wdm0006 closed 8 years ago

wdm0006 commented 8 years ago

It would be nice to be able to pass in an encoding type to use something more than the default label encoding. I have a library: category encoders, which does that, and it can be easily added in with one extra flag. (suggested -en for encoder).

I have a not-yet-tested implementation of this at:

https://github.com/wdm0006/datacleaner

Which just carries over the available encoders:

A deeper look into the differences between these can be found here and here.

Let me know if you think that fits into your project, or if there is any change I can make to my implementation or the library, I can work on those and send a PR.

rhiever commented 8 years ago

Looks very useful! Have you considered merging your functions (that don't overlap) into sklearn.preprocessing? I think sklearn would benefit from more preprocessors.

wrt merging into datacleaner: I have some concerns, for example, what would it do with continuous variables? Would it ignore them entirely?

I also have concerns with whether feature preprocessing is within the scope of datacleaner. I see data cleaning as a step before feature preprocessing, as shown below. Data cleaning entails encoding the data properly (usually: numerically), removing or imputing missing data, removing quirks from the data, and so on. Feature preprocessing is definitely an important step, but I see it more as part of the modeling step.

image

I've been developing TPOT to automate the parts that follow the data cleaning step, and would love to add more feature preprocessing operators if they add value beyond the ones already implemented in TPOT/sklearn.preprocessing. I've already "seen the light" wrt the power of using the right feature preprocessor.

BTW: Do you know about my sklearn benchmark project? I've evaluated about 30 million sklearn models so far, and would like to look into evaluating feature preprocessors on my ~180 data set benchmark as well.

wdm0006 commented 8 years ago

I have considered a PR into sklearn.preprocessing, and probably will try one out in the coming weeks. I really need to improve the documentation and write automated tests before that though I think, for now I am dogfooding it.

For continuous variables, they would be ignored. The behavior is identical to your usage of label encoding, so any column with type object would be encoded as categorical, floats or ints would be assumed continuous (for good or for ill), which I don't think makes any new assumptions (again, for good or for ill).

As far as applicability, it is definitely a grey area, here are a few points as to why I think it makes sense:

I'll look at TPOT, I have seen it, but not used it for anything yet, looks like an interesting project (sklearn benchmark as well)

rhiever commented 8 years ago

I'm liking the sound of this a little more. The goal of datacleaner is to automatically put the data into a good state for analysis, but not necessarily make major feature encoding decisions that would make it difficult for the practitioner to encode the data in their own way in the future. Hence why I don't mind doing a direct string label --> numerical label encoding (since numerical encodings are necessary for sklearn etc.), but I would avoid transforming the data into a OneHot feature representation.

Encoders like hashing can greatly reduce dataset size, which lets datacleaner produce a clean and (more) portable version of the dataset.

I'm very intrigued by this -- do you have a demo? Are there other things like this that we could do without affecting the basic feature representation?

wdm0006 commented 8 years ago

Sure, check out the tables on this post: beyond one-hot, they are basically trying to find high scoring encoders with low dimensionality (fewer columns). In these cases, every column was categorical (strings), so all had to be encoded as numbers somehow.

The hashing encoder is not in those tables. It, unlike the others, encodes multiple columns at once and allows for a configurable output dimensionality. So if you have 128 categorical input columns, you could encode that as 3 (or 10 or 20 or whatever) columns with the hashing trick. It might not be perfect, but it's smaller. Here you can see the degrading performance with really low dimension outputs (hashing_2 and 4 vs 16+).

I agree that the default for datacleaner should be bare minimum encoding (so just ordinal), and that one-hot is risky for very high dimension data (could end up with huge numbers of columns), but I think the scikit-learn philosophy of good options with sensible defaults (so multiple options but default to ordinal) would make sense here. If not that, then don't do encoding at all at this stage.

rhiever commented 8 years ago

Hmmm... I think you've convinced me, at the very least, that the encoder should be configurable.

So the lines such as this one can be replaced with an arbitrary encoder (LabelEncoder, OneHotEncoder, etc. -- anything that follows the sklearn interface), with LabelEncoder as the default. Won't even be necessary to import any other libraries into datacleaner itself because those encoders would be passed via the function, and imported by the user.

It might be nice to even allow a list of encoders to be passed, but that may complicate things too much and step too far into the feature preproceessing stage.

wdm0006 commented 8 years ago

Sounds good!

I'll put together a pull request. One implementation detail that I'm not sure of is how to pass the number of output columns to the hashing encoder (if at all). All of the other encoders need no input parameters, but the hashing encoder takes that one.

Maybe just have one flag for encoder, with a hyphen and number for hashing, so:

for default:

datacleaner my_data.csv -o my_clean.data.csv -is , -os ,

for binary encoding:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en BinaryEncoder

for hashing encoder with 32 output dims:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en HashingEncoder-32

for hashing encoder with default params:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en HashingEncoder

That seem alright?

rhiever commented 8 years ago

This might have to be a feature that's limited to the script version, because if we want to add CLI support, we'd have to parse out every possible encoder from the CLI. That will be way too complicated, add several dependencies, and bloat the code in the long run.

I was thinking the function would look something like, e.g.,

def autoclean(input_dataframe, drop_nans=False, copy=False, encoder=LabelEncoder):
    """Performs a series of automated data cleaning transformations on the provided data set
<snip>
        # Encode all strings with numerical equivalents
        if str(input_dataframe[column].values.dtype) == 'object':
            input_dataframe[column] = encoder().fit_transform(input_dataframe[column].values)
<snip>

That of course limits us to encoders that take no input parameters, but I think I'm okay with that.

rhiever commented 8 years ago

A workaround that the user could implement to pass an encoder with parameters could be to write a wrapper function for the encoder, e.g.,

def HashingEncoder_32(data):
    return HashingEncoder(data, 32)

We could document that for advanced users.

wdm0006 commented 8 years ago

That may be. With the exception of the parameter for hashing, this actually will implement all of the encoders in just a few lines.

https://github.com/wdm0006/datacleaner/blob/master/datacleaner/datacleaner.py#L87

I could parse out the parameter ahead of time without too too much hassle too I think.

rhiever commented 8 years ago

The issue I have with that implementation is that it adds another dependency. I really want to minimize dependencies wherever possible.

rhiever commented 8 years ago

Alrighty, it's merged! Thank you for coding that up -- I think it will add some useful flexibility to datacleaner.

Please ping me if you have any thoughts on how to support that functionality on the command line.