Questions about the test phase and the training detail

hendrycks / outlier-exposure

Deep Anomaly Detection with Outlier Exposure (ICLR 2019)

Apache License 2.0

541 stars 107 forks source link

Questions about the test phase and the training detail #18

Closed xfffrank closed 3 years ago

xfffrank commented 3 years ago

Hi, I'm new to out-of-distribution detection. After reading the paper and the code, I still cannot figure out how the out-of-distribution data is detected. I see there are two related lines of code as below.

out_score = get_ood_scores(ood_loader)
measures = get_measures(out_score, in_score)

It seems that the detection process is related to the in_score. But what about the prediction in a real application scenario? I'm confused. You could give me some references to read if that's easier to explain.

In the training script, the cross entropy between the softmax distribution and the uniform distribution is implemented with this line.

loss += 0.5 * -(x[len(in_set[0]):].mean(1) - torch.logsumexp(x[len(in_set[0]):], dim=1)).mean()

How does torch.logsumexp(x[len(in_set[0]):], dim=1) represent the uniform distribution?

Thanks.

hendrycks commented 3 years ago

in_score has the "anomaly scores" for in-distribution examples, and out_score has the scores for out-of-distribution points. We'd like for in_score values to be quite distinct from out_score values to successfully perform OOD detection.

For your second question, these might be useful (the maths is complicated): https://github.com/hendrycks/outlier-exposure/issues/12 https://github.com/hendrycks/outlier-exposure/issues/14

xfffrank commented 3 years ago

Thanks for your quick reply! I still have two more questions. @hendrycks

Can I just apply the loss above directly for the OOD detection of binary classes ? What about one-class classification?
I'm wondering why the ROCAUC score is used instead of the accuracy in many OOD papers. Is it because the appropriate threshold depends on different outlier distributions?

hendrycks commented 3 years ago

If you are doing a "hot dog" vs "not hotdog" type of task, then this one-class method will probably work better: https://github.com/hendrycks/ss-ood
People use AUROC and AUPR because the OOD class or anomalous class may appear at a frequency quite unlike "usual" examples. If 1% of the examples are OOD, then it's easy to get 99% accuracy by always predicting that examples are "usual." For tasks with imbalanced data, often AUROC or AUPR is preferable to accuracy.

xfffrank commented 3 years ago

Thank you so much! I'm already reading that paper and the implementation. If you don't mind, I've got two follow-up questions.

For an imbalanced dataset containing outliers, we often choose the metric of AUROC. And if I want to know whether an image is an outlier or not, I need to set a threshold for the anomaly score (perhaps based on the outlier samples I can find). Am I right?
Will the unsupervised method in https://github.com/hendrycks/ss-ood work if I need to know whether an image belongs to a specific kind of cats? I mean, if the unknown samples can come from a close but different distribution, what can we do to avoid false positive predictions (assuming in-distribution data is positive)?

hendrycks commented 3 years ago

In practice, yes. However, the AUROC summarizes performance across different thresholds, so it is good for capturing model performance for many different use cases. https://www.dataschool.io/roc-curves-and-auc-explained/
If you have multiple labeled cat breeds, then the multiclass method of section 4.1 or this paper might be more helpful.

xfffrank commented 3 years ago

Thanks a lot!