google-research / selfstudy-adversarial-robustness

Apache License 2.0
118 stars 23 forks source link

Detection false positives? #4

Open dxoigmn opened 3 years ago

dxoigmn commented 3 years ago

When evaluating whether inputs are adversarial, the framework first checks whether the classification of the input matches the groundtruth label. If it does not, then it uses the detection mechanism to reject/ignore inputs. Only when both conditions are satisfied does the framework consider the input adversarial. https://github.com/google-research/selfstudy-adversarial-robustness/blob/15d1c0126e3dbaa205862c39e31d4e69afc08167/common/framework.py#L217-L234 My expectation was that correctly classified inputs also ought to be rejected if they trip the detector, but because L223 returns early this can never happen. This is particularly pronounced in the transform defense where a non-trivial majority of the benign inputs would be rejected by the "stable prediction" detector. Is this intentional? It’s a little weird to force the attacker to defeat some objective that the defender can almost never achieve.

carlini commented 3 years ago

Yeah, this was intentional. The idea behind the evaluate function is that it should return true only when the attacker has succeeded, and in both of these cases (not labeled wrong, or detected as adversarial) the attacker has failed.

Note that, even if we flipped the order of the if statements, the result would be the same: the function would always return false.

Although, your point might still be valid---maybe for this particular defense we made it too hard and too many benign inputs would be rejected.

If you do want to compute the clean accuracy, you can run with --test and that will run this method which will report, for the clean examples, how often they are classified correctly and not detected as adversarial.

https://github.com/google-research/selfstudy-adversarial-robustness/blob/15d1c0126e3dbaa205862c39e31d4e69afc08167/evaluate.py#L142-L148