jlko / active-testing

Active and Sample-Efficient Model Evaluation
24 stars 6 forks source link

Can this method evaluate precision or recall? #1

Open boyuzz opened 3 years ago

boyuzz commented 3 years ago

This is a great work! When I use it on my own dataset, it performs well when evaluating accuracy by using accuracy_loss. However, I'm wondering if it can evaluate precision or recall unbiasedly as well since in some class-unbalanced case, precision or recall is a more meaningful metric.

What I've done to evaluate the precision is, preactively updating the dataset.test_idx and dataset.test_remaining to refer to those samples which are predicted as positive by the model. Then I use the accuracy_loss to evaluate precision since they are identical when the testing samples are all predicted as positive by the model.

But no matter what dataset I use, the variance of active-testing is always close to i.i.d sampling despite the bias is still zero. I hope you can give me some ideas about this. Thank you.

jlko commented 3 years ago

Dear Boyuzz,

I'm glad you're enjoying our work and that it works well in your setting.

Quick comment: We found that – even if accuracy is the desired target quantity – it might make sense to acquire for log likelihood (because it may be better behaved).

I have not thought about precision/recall previously, but it looks like an interesting question.

An easy thing you can do is construct a weighted loss. (If you apply larger weights to underrepresented classes, this could already help your problem.)

More generally, active testing works for losses that are additive over the test points and we are interested in their expected value, the risk, over the test set, see https://arxiv.org/pdf/2103.05331.pdf. If you can write precision/recall as a loss applied individually to each datapoint, you can apply active testing, but I'm not sure this is possible. Even if it is not, you may be able to derive an importance sampling acquisition function by favouring points that contribute a lot to the overall precision/recall.

You could also target the TP/FP/FN rates individually, as these are just mean values over all datapoint. (I.e. construct 3 separate estimators and combine them, although this is almost certainly suboptimal.)

In short: I don't know and this will probably require some legwork.

Hope this helps! Jannik

boyuzz commented 3 years ago

Hi @jlko , thanks for your quick reply.

In last weeks, I was trying your suggestions and found that,

Expecting for your feedback. Thanks!

jlko commented 3 years ago

Hi boyuzz,

out of interest – may I ask what project you are using active testing for? You can respond to me in private if you don't feel like sharing that information publicly.

Estimating cross entropy loss is quite sensitive to the case when both model and surrogate model predict incorrectly.

Try using an ensemble for the surrogate. If the data is very noisy, you cannot avoid performance dropping to about random acquisition (as there is not much to be gained by active acquisition). However, overconfident wrong predictions in the surrogate (such as you are describing) may be fixed by choosing the surrogate from a model class that has better predictive uncertainty such as ensembles (please also see the paper for further guidance. E.g. retraining the surrogate between acquisitions, if feasible, could also help here.).

I guess that's the reason you developed "FancyUnbiasedRiskEstimatorCut"?

I can't quite remember what I tried with FancyUnbiasedRiskEstimatorCut, but it's an idea that is not fully developed. If you do not trust your surrogate model, try setting

uniform_clip: True
uniform_clip_val: 0.2. # (or higher?)

which ensures that no single test pool sample is assigned too small probability, as detailed in the appendix of the paper. This can, to some extent, help with problems from missing high loss samples in acquisition.

For the importance sampling acquisition, I've not tried it but I think the importance sampling can only guarantee zero bias but cannot guarantee that the variance is lower than iid sampling?

Active testing is a pool-based importance sampling (IS) approach. There exist an optimal (i.e. zero-variance) estimator for IS/active testing, but that is usually unknown.

this result in a biased estimation in my dataset

I can't 100% follow how you have used WeightedAccuracyLoss in your code. Active testing will definitely always give you unbiased estimates, for any valid loss function (and as long as the acquisition function assigns non-zero probability to all points left in the test pool). How are you using the WeightedAccuracyLoss with the R_SURE estimator? The WeightedAccuracyLoss as defined by you can not be evaluated pointwise, i.e. is tied together by the sum(weights) operation. How are you using this for R_SURE (eq 3 of the paper)? Maybe this is the problem?

I've just re-read my message to you, and there is one thing I'd like to clarify: If you care about a weighted loss, you should also weigh the acquisition function (and not just the loss) accordingly. I.e. re-deriving the acquisition for cross-entropy/accuracy but with weights (similar to how we did in the paper). The acquisition function should cater to the loss for sample-efficient active testing (please see the paper).