ryderling / DEEPSEC

DEEPSEC: A Uniform Platform for Security Analysis of Deep Learning Model
MIT License
209 stars 71 forks source link

Computing the average over different threat models is meaningless #5

Open carlini opened 5 years ago

carlini commented 5 years ago

Security is all about worst-case guarantees. Despite this fact, the paper makes many of the inferences by looking at the average-case robustness.

This is fundamentally flawed.

If a defense gives 0% robustness against one attack and 100% robustness against another attack the defense is not "50% robust". It is 0% robust. Completely broken and ineffective.

Now this doesn't preclude it from being possibly useful or informative in some settings. But it can not in good faith be called partially secure. If a defense argues l_2 robustness and a l_2 attack can generate adversarial examples on it with similar distortion of an undefended model, then it's broken. The fact that some other l_2 attack fails to generate adversarial examples is irrelevant.

When you are averaging across multiple different attacks, many of which are weak single-step attacks, it artificially inflates the apparent robustness. Imagine if there was another row that measured robustness to uniform random noise within the distortion bound--by adding this attack all defenses would suddenly appear more robust, which clearly is not the case.

ryderling commented 5 years ago

Security is all about worst-case guarantees. Despite this fact, the paper makes many of the inferences by looking at the average-case robustness.

This is fundamentally flawed.

If a defense gives 0% robustness against one attack and 100% robustness against another attack the defense is not "50% robust". It is 0% robust. Completely broken and ineffective.

Now this doesn't preclude it from being possibly useful or informative in some settings. But it can not in good faith be called partially secure. If a defense argues l_2 robustness and a l_2 attack can generate adversarial examples on it with similar distortion of an undefended model, then it's broken. The fact that some other l_2 attack fails to generate adversarial examples is irrelevant.

When you are averaging across multiple different attacks, many of which are weak single-step attacks, it artificially inflates the apparent robustness. Imagine if there was another row that measured robustness to uniform random noise within the distortion bound--by adding this attack all defenses would suddenly appear more robust, which clearly is not the case.

Again, we do agree that security is the worst-case guarantee, and no system/model is absolute secure from this perspective. However, from a practical or statistical point of view, security is often a kind of relative or statistical security.

Instead of taking all defenses as absolutely insecure (0% robustness), we think it is even more important to experimentally evaluate the effectiveness of defenses against existing attacks and capture the overall security differences of different types of defenses instead of one particular defense. For instance, in security community, people test a malware against all vendors in VirusTotal, and the malware that bypasses the most vendors would be considered stronger. Such statistical knowledge is also important to know for the community together with the worst case analysis.

As you suggested, the evaluation for defenses against all kinds of attacks might need more fine-grained experiments (distinguish single-step and iterative attacks? gradient-based and optimization-based attacks?) as well as more detailed discussion.

carlini commented 5 years ago

Reading my message I realize that above I essentially repeated the comment from #2, when I meant to say something different. Sorry about that -- the argument I meant to give is the following for this issue:

You are computing the mean over different threat models, which gives a number that is completely uninterpretable. When you say that some defense is 60% robust on average (over the different threat models), how should we interpret this number? The only true interpretation is the following: if an adversary randomly chooses the l_infty threat model with probability 50%, l_0 threat model with probability 5%, and l_2 threat model with probability 45%, and then randomly chooses an attack among the possible attacks, the defense will have 60% accuracy against that attack.

Is this number at all the number that you want to measure? Almost certainly no. Even if there was a way to justify using averages instead of minimums, computing the average over multiple different threat models is even more meaningless.

(This gets even more complicated when you realize that you're not only averaging over the threat model, but also over the attacker objective of targeted vs. untargeted attacks. But let's just leave that aside.)