topepo / FES

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson
https://bookdown.org/max/FES
GNU General Public License v2.0
717 stars 237 forks source link

A bit of feedback #4

Open ifellows opened 6 years ago

ifellows commented 6 years ago

Great idea for a book. So much of an analyst's time is spent here, and no real resources are available treating it in a comprehensive manner.

I did a quick skim over your work so far. Great job! here are a few thoughts:

  1. I think that some time should be spent on engineering features with a specific outcome in mind. For example, you cover PCA, but not LDA.

  2. There are a lot more 1:1 transformations that I imagine your readers would want to know about. see: http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

  3. I'm not a big fan of hashing categorical variables, though it definitely should be covered. Another option when you have a specific outcome (and a prediction algorithm that needs dimension reduction) is to fit a simple hierarchical model e.g.:

$$ y_{ij} = \thetajX{ij} + \epsilon_{ij} $$ where $$ \theta_j \sim N(0, \sigma). $$ Here X is the dummy coded categorical variable. Then use \theta as the encoded value for X. This shrinks rare categories toward 0 so they don't have a lot of impact, and puts the rest on the right scale for a regression. Often I find that interactions with this extracted feature are important.

On a related note, you should mention the option of including dummy codes for the c most common categories and an 'other' bucket for the rest.

  1. Feature engineering is perhaps most important in dealing with text, where the raw data is rarely if ever analyzed directly. This should probably be a whole chapter alone and deal with bag-of-words, co-occurance, sentiment analysis, word vectors, stemming, tokenizing, etc.

  2. What about feature extraction from network data? e.g. number of friends, shared partners, etc.

  3. You sort of assume that you've already got your data in a flat file. How about dealing with engineering out of a structure DB (e.g SQL)? If you have an outcome in one table, how do you extract relevant features from other tables if there is a one-to-one, many-to-one, one-to-many or many-to-one relationship?

  4. What about image data? e.g. using the (intermediate) output of a general purpose NN model (imagenet), gabor filters, etc.

  5. Shouldn't there be a discussion of what transformations are needed for what types of models? e.g. Don't code interactions for an RF. Do scale your predictors if your fitting algorithm is gradient descent.

  6. How should outliers be handled? univariate and multivariate. The choice between row exclusion and column transformation.

topepo commented 6 years ago

Thanks for the feedback. this is very helpful. Correct me if I misinterpreted your comments.

I think that some time should be spent on engineering features with a specific outcome in mind. For example, you cover PCA, but not LDA.

We do talk about PLS but should mention that there are myriad other ways, including LDA, to do supervised dimension reduction.

There are a lot more 1:1 transformations that I imagine your readers would want to know about. see: http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

Thanks for the link. I should add something about using quantiles too.

I'm not a big fan of hashing categorical variables, though it definitely should be covered.

Me either. See section 5.2. This is dying for a good statistical solution.

Another option when you have a specific outcome (and a prediction algorithm that needs dimension reduction) is to fit a simple hierarchical model e.g....

That's good. I'm on the fence about including it since it brings a lot of other machinery with it (related to the estimation procedure). Is there a reference for this?

On a related note, you should mention the option of including dummy codes for the c most common categories and an 'other' bucket for the rest.

Agreed

This should probably be a whole chapter alone and deal with bag-of-words, co-occurance, sentiment analysis, word vectors, stemming, tokenizing, etc.

What about image data? e.g. using the (intermediate) output of a general purpose NN model (imagenet), gabor filters, etc.

There is some text-related stuff in section 5.5 but that is a whole book unto itself.

We don't do a lot with image and text here because that is what 95% of the references with "feature engineering" in their titles talk about. I'd rather spend time on topics with less coverage.

What about feature extraction from network data? e.g. number of friends, shared partners, etc.

Good idea but I've never done any of that so I wouldn't be qualified to talk about it.

You sort of assume that you've already got your data in a flat file. How about dealing with engineering out of a structure DB (e.g SQL)? If you have an outcome in one table, how do you extract relevant features from other tables if there is a one-to-one, many-to-one, one-to-many or many-to-one relationship?

There will be some of that in chapter 8 (Flattening Profile Data) with some examples.

Shouldn't there be a discussion of what transformations are needed for what types of models? e.g. Don't code interactions for an RF. Do scale your predictors if your fitting algorithm is gradient descent.

Yes. There will be something like Appendix A in Applied Predictive Modeling. I'm not sure where it goes just yet though.

How should outliers be handled? univariate and multivariate. The choice between row exclusion and column transformation.

There is some of that in section 6.3.3 but it could use some expansion.

ifellows commented 6 years ago

Great. Again, I think this is a really good idea for a book. I'm always telling people that they should focus less on what predictive algorithm they are using and more on getting the most informative features out of their raw systems. This is where the big gains are.

Regarding 3... you know I don't have a reference, but it can't be novel. The machinery is pretty simple, just do a ridge regression (or lasso to put more sparsity in there) and save the linear predictors. This book could become the reference.

rwarnung commented 6 years ago

Hi, I am really looking forward to this book as I have been working with feature engineering for a while now and I would really appreciate to have one place to look for references and learn about techniques.

I recently stumbled upon the LOL package (lolR: Linear Optimal Low-Rank Projection). It is a kind of a meta package for linear (supervised and unsupervised) dimensionality reduction methods. It has PCA, PLS, LDA and a few others maybe less known methods.

Do you plan to include T-SNE and LargeVis too?

Thank you for your work on this and all the other ressources I use (the predictive modelling book, caret, rsample,...)!

ifellows commented 6 years ago

It looks like 3 is a case of likelihood encoding (see #8). Adding shrinkage is pretty important for rare categories, and putting it on a linear predictor scale is useful for non-tree algorithms.

topepo commented 6 years ago

@rwarnung I've seen that package and it looks pretty nice. I've been using a similar one in recipes called dimRed.

Also, I'm avoiding t-sne because it cannot be used to project on to new samples (but tell me if I'm wrong about that).

alexpghayes commented 6 years ago

I believe there's a version of t-SNE that can handle new samples called parametric t-SNE (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf), but I'm not sure how strong the relationship is to normal t-SNE.

rodfloripa commented 5 years ago

I have already read chapters 1 and 2. Great book, you are doing an amazing job! I would like to know if you have the forecast of project completion date. Thanks!!

topepo commented 5 years ago

We are a few days away from finishing all of the new writing. I hope to post the remaining chapters by Feb 1 and have it go into production in April (this is the same time that the final HTML version 1.0 will be released).

rkb965 commented 5 years ago

Shouldn't there be a discussion of what transformations are needed for what types of models? e.g. Don't code interactions for an RF. Do scale your predictors if your fitting algorithm is gradient descent.

Yes. There will be something like Appendix A in Applied Predictive Modeling. I'm not sure where it goes just yet though.

Thank you so much for this incredible resource. At first glance of the book, I didn't see the above referenced discussion/table -- is it present somewhere?

topepo commented 5 years ago

Not yet. I'll put it in a markdown file.

rkb965 commented 5 years ago

Amazing, thank you. Would be helpful for my learning style and a good reference while reading through more systematically. As someone new to these techniques, I don't yet have the sophistication to understand which engineering techniques are applicable (very useful/potentially useful/ totally unnecessary) to which predictive modeling algorithms.

Looking forward to working my way through this. Thanks again!

ifellows commented 5 years ago

@topepo , Just saw some activity on this thread and looked over the book again. Truly fantastic job, congratulations! This should be an essential read for anyone looking to become a data scientist.