Closed erp12 closed 7 years ago
we should think more about how each sample contributes to roc_auc and see if we can write a "vectorized" function for lexicase.
I don't have any ideas yet for the typical 1 error per sample, but one way to get a vector of 4 errors would be to report the entire confusion matrix.
Something like:
[1 - true_pos_rate, false_pos_rate, 1 - true_neg_rate, false_neg_rate]
I am not sure if that is a good idea... I have only seen lexicase used where there is 1 error per sample.
i don't think 4 errors would work since it is not enough samples for lexicase to perform well. however, with roc_auc
you have an area calculation of a set of values form the roc. so could you take the raw roc values and interpret them as a set of fitness values to be maximized?
I am not sure what you mean by raw roc values. Would that just be the true positive rate at a particular value for false positive rate? Would that necessarily be 1 per sample?
yes, i guess it would be the true positive rate as the threshold increases.
there is no hard requirement in lexicase that there be 1 case per sample. That's just normally how it's mapped. the important thing is, roughly, that are many cases (more than, say, 15). I'm thinking something like
def roc_fit(y_true,y_pred):
fpr, tpr, _ = roc_curve(y_true,y_pred)
return 1-tpr
could work for your purposes as a 'vectorized' fitness function.
In your roc_fit
function, is y_true
and y_pred
entire arrays of labels and predictions, or just a single label and a single sample?
entire arrays. y_pred is the feature output. i should clarify that i think this would make lexicase selection work but i'm not sure it's the best way to formulate the problem. i'm also unclear on how ROC works when you have an arbitrarily scaled floating point vector for y_pred, which could be the case with a program's output in FEW.
When using a logistic regression classifier y_pred
should never be outside the 0 to 1 range, right?
Perhaps I am not understanding how the programs in the population are evaluated as transformations.
I assumed that the model set as the ml
param was trained and their (cross-validated?) performance using the fit_choice
metric is considered the fitness of the transformations. Something like that?
oh! no. the feature transformations each get their own fitness to determine which survive. This is a separate step from evaluating the performance of the ML method with which FEW is paired. Currently, that scoring function is specific to the ML.
So is the fit_choice
basically used to get an error for each transformation by predicting the output based only that single transformation?
yes. check out the gecco paper where define & compare different test metrics.
i looked into the roc_curve metric in sklearn a bit more, and it seems like you need an estimator with a decision function to get a reasonable result. is that right?
Tested on a single sample dataset and it seams to work well.
Currently, it is not compatible with any lexicase selection variants because there is no function that returns a vector of roc auc values. I am not sure what such a function would look like, because it is impossible to compute the roc auc of a single prediction.