guyera / Generalized-ODIN-Implementation

24 stars 3 forks source link

Finding varpsilon and threshold #7

Closed ROBOTICSENGINEER closed 3 years ago

ROBOTICSENGINEER commented 3 years ago

According to the G-ODIN paper, both varpsilon and threshold should be tune on only known data. However, the current code use ODIN method, i.e. using ROC and grid search.

guyera commented 3 years ago

According to the G-ODIN paper, both varpsilon and threshold should be tune on only known data

  1. It is true that the perturbation magnitude should be tuned only on known data. This, along with the decomposed confidence head (f(x) = h(x) / g(x) logits) is the main difference between G-ODIN and ODIN. However, this implementation does provide metrics for tuning only on the known data.

    When run, a "true best auc" and a "supposedly best auc" is reported. The "supposedly best auc" is the AUC associated with the perturbation magnitude which yielded the largest value of validation_results. validation_results is simply the average maximum softmax score over the validation set, which consists only of known data (see get_datasets). The maximum softmax score of a data point is also its nominality score S, and so this lines up with how the perturbation magnitude was tuned in the G-ODIN paper (See equation 10 in the paper; the paper maximizes the sum of in-distribution validation scores whereas we maximize the mean, but these two objectives are equivalent).

    The "true best auc" is the actual largest AUC reported across all perturbation magnitudes (with no validation at all; each AUC is computed using the full test set, including both nominals and anomalies). The only reason the "true best auc" is reported at all is to get a feel of how "useful" the validation method is to begin with. If the supposedly best AUC is far lower than the true best AUC, then that means that the validation scores are not perfectly representative of the quality of the perturbation magnitude. In other words, its a metric for measuring how well G-ODIN's known-only hyperparameter tuning actually works.

    So if you're looking to reproduce the paper's results, you should look at the "supposedly best auc" reported.

  2. What do you mean by the "threshold"? If you mean the threshold over the nominality scores used to make a binary (nominal vs anomaly) decision, then this is not tuned at all in the G-ODIN paper (nor ODIN, nor most other OOD papers). Rather, the choice of threshold is entirely dependent on the nature of the application (See section 2, "Background", of the G-ODIN paper). For this reason, methods like G-ODIN often use AUROC for evaluation and comparison; AUROC is inherently independent of any choice of threshold.