lacava / few

a feature engineering wrapper for sklearn
https://lacava.github.io/few
GNU General Public License v3.0
51 stars 22 forks source link

Added roc_auc as a fit_choice #5

Closed erp12 closed 7 years ago

erp12 commented 7 years ago

Tested on a single sample dataset and it seams to work well.

Currently, it is not compatible with any lexicase selection variants because there is no function that returns a vector of roc auc values. I am not sure what such a function would look like, because it is impossible to compute the roc auc of a single prediction.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.3%) to 77.498% when pulling 96f0ca56f504cc7c469de9db0c3573aa5ff30ef1 on massmutual:roc_auc into c4c50d0cc837f38f163a59618f450c3424156670 on lacava:master.

lacava commented 7 years ago

we should think more about how each sample contributes to roc_auc and see if we can write a "vectorized" function for lexicase.

erp12 commented 7 years ago

I don't have any ideas yet for the typical 1 error per sample, but one way to get a vector of 4 errors would be to report the entire confusion matrix.

Something like: [1 - true_pos_rate, false_pos_rate, 1 - true_neg_rate, false_neg_rate]

I am not sure if that is a good idea... I have only seen lexicase used where there is 1 error per sample.

lacava commented 7 years ago

i don't think 4 errors would work since it is not enough samples for lexicase to perform well. however, with roc_auc you have an area calculation of a set of values form the roc. so could you take the raw roc values and interpret them as a set of fitness values to be maximized?

erp12 commented 7 years ago

I am not sure what you mean by raw roc values. Would that just be the true positive rate at a particular value for false positive rate? Would that necessarily be 1 per sample?

lacava commented 7 years ago

yes, i guess it would be the true positive rate as the threshold increases.

there is no hard requirement in lexicase that there be 1 case per sample. That's just normally how it's mapped. the important thing is, roughly, that are many cases (more than, say, 15). I'm thinking something like

def roc_fit(y_true,y_pred):
    fpr, tpr, _ = roc_curve(y_true,y_pred)
    return 1-tpr

could work for your purposes as a 'vectorized' fitness function.

erp12 commented 7 years ago

In your roc_fit function, is y_true and y_pred entire arrays of labels and predictions, or just a single label and a single sample?

lacava commented 7 years ago

entire arrays. y_pred is the feature output. i should clarify that i think this would make lexicase selection work but i'm not sure it's the best way to formulate the problem. i'm also unclear on how ROC works when you have an arbitrarily scaled floating point vector for y_pred, which could be the case with a program's output in FEW.

erp12 commented 7 years ago

When using a logistic regression classifier y_pred should never be outside the 0 to 1 range, right? Perhaps I am not understanding how the programs in the population are evaluated as transformations.

I assumed that the model set as the ml param was trained and their (cross-validated?) performance using the fit_choice metric is considered the fitness of the transformations. Something like that?

lacava commented 7 years ago

oh! no. the feature transformations each get their own fitness to determine which survive. This is a separate step from evaluating the performance of the ML method with which FEW is paired. Currently, that scoring function is specific to the ML.

erp12 commented 7 years ago

So is the fit_choice basically used to get an error for each transformation by predicting the output based only that single transformation?

lacava commented 7 years ago

yes. check out the gecco paper where define & compare different test metrics.

i looked into the roc_curve metric in sklearn a bit more, and it seems like you need an estimator with a decision function to get a reasonable result. is that right?