tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
565 stars 112 forks source link

Feature hashing step #189

Closed EmilHvitfeldt closed 6 years ago

EmilHvitfeldt commented 6 years ago

Do you think recipes would benefit from having a feature hashing step in it?

topepo commented 6 years ago

Yes, it's been on the list for a while but nothing has happened yet. I started a project page for this.

Once the code to produce the hash is there, it would be pretty easy to do. You would like to give it a go?

EmilHvitfeldt commented 6 years ago

I don't mind giving it a go!

Determine a proper hashing method

Looking around I see that both scikit learn and the FeatureHashing package uses the signed MurmurHash3 hashing function so it seems like an appropriate method. It is furthermore already implemented in FeatureHashing which is nice (I assume we can add a Suggests dependency right?).

Decide whether to save as a new factor variable or to make indicators. The latter would probably be faster but inconsistent with step_dummy. If signed hashes are used, it would have to go straight to dummy variables.

I'm a little unsure of what you want with this sentence. What would be the ideal outcome? If we are using signed hashes, then how would we do dummy variables? Am I wrong assuming that dummy variables needs to have the values 0 or 1?

Lastly there is a design question I would like to run past you, this is all assuming I'm wrong in assuming dummy variables are limited to 0 and 1. If we have a variable with the value like

"half a day. It was a melancholy change; and Emma could not but sigh over" (100th line of janeaustenr::emma)

would we want the hash to be (by hashing the full line together)

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (hashing size low for demonstration purposes)

or the hash to be (by tokenizing the words, hashing the words, adding the hashes together)

0 -1 0 0 0 -1 1 0 0 -2 2 0 0 0 1 3

Personally I think that both kinds should be available. For longer items like texts, emails and chapters/books the second version would be to prefer as it captures the number of times the different words appears. However in in example of the diet variable in okc dataset, the first style might be to prefer. What Are your thoughts? If we are going to implement the second kind maybe some work in Text Processing Recipes should be done first.

topepo commented 6 years ago

I assume we can add a Suggests dependency right?

We could but I'd like to avoid a dependency (at least one that does the same thing as what we are doing). Also, the function that does the actual hashing via Rcpp, FeatureHashing:::.hashed.model.matrix.dataframe, isn't exported so we couldn't get to that directly. Could any of the other packages related to hashing work? This isn't my forte.

Decide whether to save as a new factor variable or to make indicators. The latter would probably be faster but inconsistent with step_dummy. If signed hashes are used, it would have to go straight to dummy variables.

I'm a little unsure of what you want with this sentence. What would be the ideal outcome? If we are using signed hashes, then how would we do dummy variables? Am I wrong assuming that dummy variables needs to have the values 0 or 1?

Unsigned hashes just map the original values to a fixed number of new groups. It would be more efficient to store these has a factor with new levels (which would be consistent with existing operations like step_other). Then the user can use step_dummy to get the numeric encodings...

... unless they want signed hashes which, as you point out, couldn't be encoded as dummy variables.

this is all assuming I'm wrong in assuming dummy variables are limited to 0 and 1.

My thinking is that they are binary. There are other encoding methods that produce "dense" representations (like those in embed and for ordinal factor variables) but I wouldn't call those dummy variables or indicator variables.

Personally I think that both kinds should be available.

I agree but I'd like to defer word operations to the text-recipe-package-to-be. However, the functions in recipes should have the groundwork to do the parts of word processing that they can handle (i.e. let's not implement the same thing twice).

Also, tensorflow has some tokenizers too. Would this overlap with those?

EmilHvitfeldt commented 6 years ago

Would you be opposed to me adding the hashing function via Rcpp?

topepo commented 6 years ago

Yes, that was my thought too.

EmilHvitfeldt commented 6 years ago

I'll see what I can do.

It would be more efficient to store these has a factor with new levels (which would be consistent with existing operations like step_other). Then the user can use step_dummy to get the numeric encodings...

good idea!

My thinking is that they are binary

I'm fine with that convention.

I agree but I'd like to defer word operations to the text-recipe-package-to-be. However, the functions in recipes should have the groundwork to do the parts of word processing that they can handle (i.e. let's not implement the same thing twice).

Is this saying that there are plans for a separate text-recipe package that handles more text related tasks? Or that recipes should leverage the power of text packages instead of reimplementing?

Also, tensorflow has some tokenizers too. Would this overlap with those?

I don't do too much Tensorflow, but I am aware that it includes tokenizers, one-hot encodings and hashing trick for text preprocessing.

topepo commented 6 years ago

Is this saying that there are plans for a separate text-recipe package that handles more text related tasks? Or that recipes should leverage the power of text packages instead of reimplementing?

I was planning on a package that wraps the best of the existing text tools by containing recipe step functions for those tasks. It would also isolate those package dependencies (similar to the embed package).

EmilHvitfeldt commented 6 years ago

I was planning on a package that wraps the best of the existing text tools by containing recipe step functions for those tasks. It would also isolate those package dependencies (similar to the embed package).

Sign me up! I have spend the last couple of months almost creating a package making text classification easier.

I'll just let you know that found the source code for murmurhash and I'm working on getting it to play well with R (very limited experience with rcpp)

topepo commented 6 years ago

very limited experience with rcpp

Me too but let me know if you run into any issues.

Sign me up! I have spend the last couple of months almost creating a package making text classification easier.

I'll start a new issue for that. I'll close this one but reopen if there are issues or questions. Thanks!

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.