[FEA] Genetic Programming for Feature Engineering

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.04k stars 521 forks source link

[FEA] Genetic Programming for Feature Engineering #2121

Open aerdem4 opened 4 years ago

aerdem4 commented 4 years ago

Is your feature request related to a problem? Please describe. Genetic Programming is very useful for feature engineering but main challenge is its time complexity. Luckily, they are easily parallelizable. Therefore, I believe it is a good fit for cuML.

Example: Let's assume you have 2 columns A and B, and a binary target. This target is 1 most of the time when A > B. It is very difficult to learn it with a tree based model but GP can engineer this feature for you.

Describe the solution you'd like I would like to have the functionalities of gplearn accelerated on GPU. (https://gplearn.readthedocs.io/en/stable/)

teju85 commented 4 years ago

@aerdem4 so, are you only looking for a gpu-accelerated SymbolicTransformer?

aerdem4 commented 4 years ago

@teju85 I think all of them are the same except the metric. Multiple options for the metric would be nice but spearman is the most useful.

JohnZed commented 4 years ago

Alright, whose idea of a joke was it to tag this with Good First Issue? I'm looking at you @wxbn ! ;)

teju85 commented 3 years ago

@aerdem4 we are going to have an intern provide us with an initial implementation of this in cuML! For starters, can we assume max program AST depth of 10 or so? Or do you think that's too low to begin with? In practice, what's the deepest program you've come across?

aerdem4 commented 3 years ago

@teju85 thanks for the good news! I think 10 is enough for AST depth. Generated features don't need to be very complex but should capture the interactions the model can't. If the intern needs any help, I would be happy to be involved btw.

teju85 commented 3 years ago

tagging @vimarsh6739 who'll be implementing this.

aerdem4 commented 3 years ago

A simple Kaggle test case: https://www.kaggle.com/c/loan-default-prediction This dataset has 800 features. People claim that without extracting the feature f527-f528, GBM performs poorly in this old competition. There may be more complex magic features too.

I can also create artificial datasets that we can test if GP can reverse engineer the features that contribute to the target.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.