Open thomas-tanay opened 6 years ago
Quick update:
I've made a few modifications to the “SVM on MNIST” section, based on my previous response.
I had already tried to improve our discussion of the non-linear case in previous revisions.
Lastly, I'd like to quickly mention the recent work of Anish Athalye, Nicholas Carlini and David Wagner showing that designing good defenses against adversarial examples is difficult, and little progress has been made so far. I agree in particular with Angus Galloway's comment:
Typically, people start with something state-of-the-art, and then claim some small drop in accuracy such that they see a boost against adversarial examples. Clearly this approach of attempting to defend so called state-of-the-art models isn't working. Heavily regularising a vanilla CNN (e.g with weight decay, inspired by Tanay & Griffin) is competitive with data augmentation / adversarial training techniques, but you see significant degradation of clean test accuracy as well. Unlike with adversarial training however, there is reason to believe that a model regularised in this way will be robust to a greater variety of attacks, and the fall-off less severe immediately beyond the perturbation magnitude used in training, as shown in Figure 6 in Madry et al.
I believe that one way forward, without relying on heavy data augmentation, is to start with small models that learn something useful, test that whatever little accuracy they obtain does not degrade with adversarial and non-examples, then progressively add more capacity in an iterative loop until satisfactory performance is reached.
This is the approach that we tried to follow in the present submission.
Thanks to Reviewer B and C for their comments and time reviewing our submission.
Our response below is organized in two parts. We start by clarifying our goal and the way we approached it, before addressing the specific points raised by each reviewer.
Approach
Problem
The adversarial example phenomenon has attracted considerable attention and many elaborate attempts have been made at solving it – most of them leading to disappointing results. We believe that it is useful in this context to step back, focus on a simpler problem, and then progressively build up from there. Linear classification in particular appears as a sensible first step.
The existence of adversarial examples in linear classification has been known for several years, and the current dominant explanation is that they are a property of the dot product in high dimension: “adversarial examples can be explained as a property of high-dimensional dot products” [1]. This explanation has had a significant influence on the field and is still often mentioned when introducing the phenomenon (e.g. [2,3,4,5]). Yet we believe that it presents a number of limitations.
First, the formal argument is not entirely convincing: small perturbations do not provoke changes in activation that grow linearly with the dimensionality of the problem when they are considered relatively to the activations themselves. Second, a number of results are not predicted by the linear explanation.
A 2-dimensional problem can suffer from adversarial examples, as shown by our toy problem:
Some high-dimensional problems do not suffer from adversarial examples. Again, our toy problem can illustrate this, if we consider that the images are 100 pixels wide and 100 pixels high (for instance) instead of being 2-dimensional:
More generally, varying the dimensionality of the problem does not actually influence the phenomenon.
Consider for instance the classification of 3 vs 7 MNIST digits with a linear SVM (from our arxiv paper). We do this on the standard version of the dataset and on a version where each image has been linearly interpolated to a size of 200*200 pixels (for the two datasets, we also perturbed each image with some noise to add some variability)
Increasing the image resolution has no influence on the perceptual magnitude of the adversarial perturbations, even if the dimensionality of the problem has been multiplied by more than 50.
However, varying the level of regularization does influence the phenomenon. This observation was for instance made by Andrej Karpathy in this blog post:
a “linear classifier with lower regularization (which leads to more noisy class weights) is easier to fool [left]. Higher regularization produces more diffuse filters and is harder to fool [right]”
This result is not readily explicable by the linear explanation of [1].
Results
To resolve the previous misconceptions and explain the phenomenon of adversarial examples in linear classification, we introduce a number of ideas – some of which we thought were novel and worth sharing.
For instance, we show that:
Or, in short: L2 regularization controls the angle between the learned classifier and the nearest centroid classifier, resulting in a simple picture of the phenomenon of adversarial examples in linear classification.
Limits
In the second part of the article, we apply our new insights to non-linear classification. We observe that weight decay still acts on the scaling of the loss function and can therefore be interpreted as a form of adversarial training. We test this hypothesis on a very simple problem (LeNet on MNIST) and show that weight decay has indeed a significant influence on the robustness of our model.
Admittedly, the phenomenon is likely to be more complicated with deeper networks on more sophisticated datasets (more non-linearities and other forms of regularization at play). But we still think that our discussion of LeNet on MNIST constitutes a small step towards a better understanding of the phenomenon: it shows that our analysis does not completely break down as soon as we introduce some non-linearities, and it shows that L2 weight decay plays a more significant role than previously suspected (at least in this simple setup). Our hope is that this result will encourage further investigations of the relation between regularization and adversarial examples in deep networks.
Specific criticisms
Reviewer B
Maybe at a certain point, the problem really becomes a semantic one: what do we choose to call an adversarial example? In their seminal paper, Szegedy et al. [6] defined adversarial examples as the result of applying “an imperceptible non-random perturbation to a test image”. Adversarial perturbations are also typically difficult to interpret (as mentioned briefly in Goodfellow et al. [1]: “this perturbation is not readily recognizable to a human observer as having anything to do with the relationship between 3s and 7s.”).
These two conditions are met in linear classification when the boundary is strongly tilted:
But not when the boundary is not tilted (i.e. for the nearest centroid classifier). In that case, the perturbations become highly visible, and easy to interpret (as a difference of centroids):
The first case is counter-intuitive and necessitates an explanation. The second case is hardly surprising. In my opinion, the images in the second case should not be called “adversarial examples” but should instead be considered as “fooling images”: non-digit images which are recognized as digits with high confidence (a phenomenon more akin to the one discussed by Nguyen et al. [7]). If we make this distinction, then we can reasonably claim that in linear classification, “adversarial examples are primarily due to the tilting of the decision boundary”.
We agree that weight decay is a relatively crude instrument, and we tried to be transparent about the fact that, although we do believe that weight decay constitutes an effective regularizer against adversarial examples for LeNet on MNIST, this result is unlikely to generalize completely to state-of-the-art networks on more sophisticated datasets.
The text may still give the impression that we make unreasonable claims and we will try to improve this aspect further in our revisions.
This is an interesting remark. This observation does indeed suggest that neural networks present some symptoms of underfitting. Yet, they also clearly show some symptoms of overfitting, as emphasized for instance by the result of Zhang et al. [8]: neural networks often converge to zero training error, even on a random labelling of the data. Perhaps these two views are compatible: neural networks may need additional capacity to successfully fit adversarial perturbations, but they may also need additional regularization to help use the additional capacity in a meaningful way.
Our limited coverage of related work was mainly due to space considerations but I would be happy to expand further. I spent the month of November writing a literature review for my MPhil to PhD transfer report, and I've tried to keep the same writing style as for the Distill post. Some parts of it could potentially be polished and turned into a section or added as an appendix.
I understand your concern and I do agree that over-claiming is generally harmful and should be avoided. However, I thought that some of our ideas where indeed novel. For instance, I don't think it has been observed before that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier (hence the phrase: “a new angle”).
Reviewer C
The overall goal of the piece is to provide an explanation of the adversarial example phenomenon in linear classification (summarized in conclusion: “our main goal here was to provide a clear and intuitive picture of the phenomenon in the linear case, hopefully constituting a solid base from which to move forward.”)
As emphasized before, we do not consider this piece to be purely pedagogical: clarity is important to us, but we also introduce a number of new ideas. In particular, we show that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier.
Thank you for the references. I will try to add a comparison between these works and ours.
It is true that weight decay and adversarial training are not the same thing, but they share some similarities. In particular, both of them can be seen as a way of attributing penalties to correctly classified images during training (by moving them across the boundary with adversarial training, and by rescaling the loss function with weight decay). This is why we call weight decay “a form of adversarial training” or that we use phrases such as “the type of first-order adversarial training that L2 regularization implements”.
Thank you for your question. This first plot is very important in my view and I realize now that I may have failed to explain it clearly. I am planning to do a number of modifications to improve this.
Let me try to explain it again here.
Consider the problem of classifying 2s versus 3s MNIST digits.
There exists a plane containing z and w: we call it the tilting plane of w. We can find a vector n such that (z,n) is an orthonormal basis of the tilting plane of w by using the Gram-Schmidt process: n = normalize(w – (w.z) z).
We can then project the training data in (z,n) and we obtain something that looks like this:
The horizontal direction passes through the two centroids and the vertical direction is chosen such that w belongs to the plane (the hyperplane boundary simply appears as a line). Remark also that since (z,n) is an orthonormal basis, the distances in this plane are actual pixel distances.
Now, we obtain the first animation (and the two related ones from the section “Example: SVM on MNIST”) by repeating this process 81 times with the regularization parameter lambda varying between 10^-1 and 10^7 (the exponent increasing by steps of 0.1). Remarkably, the tilting angle between z and w varies monotonically with lambda.
To understand why the data points appear to be moving around when lambda varies, one needs to imagine the tilting plane rotating around z in the n-dimensional input space (thus showing a different section of the n-dimensional training data for each value of lambda).
This idea can be illustrated with the following simplified scenario: z is the weight vector of the nearest centroid classifier. w1 is the weight vector of an SVM model trained with high regularization (lambda = 10^5). w2 is the weight vector of an SVM model trained with low regularization (lambda = 10^-1). w_theta rotates from w1 to w2.
Using the Gram-Schmidt process again, we find the vectors e1 and e2 such that (z,e1,e2) forms an orthonormal basis of the 3D subspace containing z, w1 and w2 (and by definition, w_theta): e1 = normalize(w1 – (w1.z) z) e2 = normalize(w2 – (w2.z) z – (w2.e1) e1)
We then project the training data in (z,e1,e2) and consider the boundaries defined by w1 and w2 (in light grey) and the boundary defined by w_theta (in orange). Below, we observe the space from a viewpoint that is orthogonal to z and w_theta for fives different values of theta:
Although the 3D data is static, the points appear to be moving around because the tilting plane and the viewpoint are rotating around z (we see how the adversarial distance decreases as w_theta tilts from w1 to w2).
In the first animation, the situation is more complex because the 81 defined weight vectors span a subspace that is more than 3-dimensional. This subspace can no longer be visualized, but the projections of the training data into the tilting plane still can.
I am not sure to what experiment you are referring specifically.
For a linear classifier, Szegedy et al actually observed a direct relation between the value of the regularization parameter lambda and the average minimum distortion: FC(10^-4) → 0.062 FC(10^-2) → 0.1 FC(1) → 0.14 which seems to be consistent with our results. We expect lower regularization levels to lead to even smaller average minimum distortions (the values of lambda reported here are not directly comparable to ours).
There are two conceivable ways of evaluating the robustness of a model to adversarial perturbations. As suggested above, most authors fix the size of the perturbation (epsilon) and report an error rate. Here we choose to fix the confidence level (median value of 0.95) and report the size of the perturbation instead (we find it more adapted to the visual evaluation task that we focus on). Arguably, both approaches have advantages and disadvantages.
We do have some results with a Network in Network architecture trained on SVHN. Overall, they suggest that weight decay does play a role and the minimum distortion tends to be higher and more meaningful for the network trained with higher weight decay.
weight decay = 0, test error = 8.1%
weight decay = 0.005, test error = 7.1%
However, it is difficult to know exactly what is going on there:
For these reasons, LeNet on MNIST appeared as a simpler model to study as a first step.
In fact, what puzzles me most about the results with the NiN on SVHN is that even without weight decay, the adversarial perturbations tend to be much larger than those affecting models trained on ImageNet. In future work, I am planning to study in more details under what conditions neural networks become more vulnerable to adversarial perturbations.
[1] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). [2] Kereliuk, Corey, Bob L. Sturm, and Jan Larsen. "Deep learning and music adversaries." IEEE Transactions on Multimedia 17.11 (2015): 2059-2071. [3] Warde-Farley, David, and Ian Goodfellow. "11 Adversarial Perturbations of Deep Neural Networks." Perturbations, Optimization, and Statistics (2016): 311. [4] Nayebi, Aran, and Surya Ganguli. "Biologically inspired protection of deep networks from adversarial attacks." arXiv preprint arXiv:1703.09202 (2017). [5] Anonymous. "Thermometer Encoding: One Hot Way To Resist Adversarial Examples." International Conference on Learning Representations (2018). Under review. [6] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013). [7] Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [8] Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).