softsys4ai / athena

Athena: A Framework for Defending Machine Learning Systems Against Adversarial Attacks
https://softsys4ai.github.io/athena/
MIT License
41 stars 9 forks source link

Normalized l2-dissimilarity #35

Closed ScottLiao920 closed 4 years ago

ScottLiao920 commented 4 years ago

Hi I'm trying to reproduce your experiment but while I was trying to evaluate FGSM attacks under white box settings, I found that the normalized l2-dissimilarity for a FGSM at eps=0.1 is only about 0.007. As in your code, the upper bound for an white-box attack is determined by some pre-generated adversarial examples. I'm wondering how you guys finish the experiments as in Fig. 7 in your paper?

MENG2010 commented 4 years ago

Hi Scott,

Thanks for having an interest in our work.

Our approach is an ensemble model that was built upon many weak defenses (WD), it first collects predictions from all WDs then finds the final label for given input using some strategy. In the context of white box threat model, the attacker knows everything regarding the model (Section IV.C) and wants to generate adversarial examples (AE) that fool as many WDs as possible, one possible approach is by introducing adversarial perturbations based on various WDs into one AE. To do so, we use a greedy approach to add perturbations based on multiple WDs into one AE (Algorithm 1). This algorithm has two constraints N (the number of WDs based on which we generate the adversarial perturbations, for a white box threat model of a Majority-Voting ensemble, $N=#_of_WDs \times 0.9$.) and max_dissimilarity (the maximum l2-dissimilarity of each generated AE, in our study, we generated 10 sets of AEs with max_dissimilarity equals 0.1 through 1.0).

Back to your question, yes, dissimilarity of an AE generated by FGSM (eps=0.1) is very small, but by adding many tiny perturbations into an AE (we did this with a while-loop, that is, in each iteration, we added the perturbation generated by FGSM (eps=0.1) based on the selected targeted WD into the AE), we can generate an AE that is strong enough to fool N WDs. You can check the sample AEs in Figure 16.

In Figure 7., we also evaluated the AEs with a simple detector (Reuben Feinman, Ryan Curtin, Saurabh Shintre, and Andrew Gardner. Detecting adversarial samples from artifacts. arXiv:1703.00410, 2017) to show that such AEs, although fooled the ensemble model, are easily detectable by a simple detector. We have a detailed discussion in Section IV.F.

ScottLiao920 commented 4 years ago

Hi Meng,

Thanks for reply. As indicated in your code (/src/evaluation/eval_whitebox.py, line 291), it seems that the max_dissimilarity is set as the normalized l2 norm between benign samples and adversarial examples generated by FGSM attack at eps=0.1. According to my experiement, the normalized frobenius norm of them should be as small as 0.01. Just wonder do i need to adjust the max_perturb argument in line 291 as 0.1~1.0 during experiments?

Thanks

MENG2010 commented 4 years ago

For each set of AE, it is a value passed via parameter max_dissimilarity.

ScottLiao920 commented 4 years ago

Oh understood. Thanks!