Thanks to Reviewer B and C for their comments and time reviewing our submission.

Our response below is organized in two parts. We start by clarifying our goal and the way we approached it, before addressing the specific points raised by each reviewer.

Approach

Problem

The adversarial example phenomenon has attracted considerable attention and many elaborate attempts have been made at solving it – most of them leading to disappointing results. We believe that it is useful in this context to step back, focus on a simpler problem, and then progressively build up from there. Linear classification in particular appears as a sensible first step.

The existence of adversarial examples in linear classification has been known for several years, and the current dominant explanation is that they are a property of the dot product in high dimension: “adversarial examples can be explained as a property of high-dimensional dot products” [1]. This explanation has had a significant influence on the field and is still often mentioned when introducing the phenomenon (e.g. [2,3,4,5]). Yet we believe that it presents a number of limitations.

First, the formal argument is not entirely convincing: small perturbations do not provoke changes in activation that grow linearly with the dimensionality of the problem when they are considered relatively to the activations themselves. Second, a number of results are not predicted by the linear explanation.

High-dimensionality is not necessary for the phenomenon to occur

A 2-dimensional problem can suffer from adversarial examples, as shown by our toy problem:

not necessary

High-dimensionality is not sufficient for the phenomenon to occur

Some high-dimensional problems do not suffer from adversarial examples. Again, our toy problem can illustrate this, if we consider that the images are 100 pixels wide and 100 pixels high (for instance) instead of being 2-dimensional:

not sufficient

Varying the dimensionality does not influence the phenomenon

More generally, varying the dimensionality of the problem does not actually influence the phenomenon.

Consider for instance the classification of 3 vs 7 MNIST digits with a linear SVM (from our arxiv paper). We do this on the standard version of the dataset and on a version where each image has been linearly interpolated to a size of 200*200 pixels (for the two datasets, we also perturbed each image with some noise to add some variability)

aes

Increasing the image resolution has no influence on the perceptual magnitude of the adversarial perturbations, even if the dimensionality of the problem has been multiplied by more than 50.

Varying the regularization level does influence the phenomenon

However, varying the level of regularization does influence the phenomenon. This observation was for instance made by Andrej Karpathy in this blog post:

karpathy

a “linear classifier with lower regularization (which leads to more noisy class weights) is easier to fool [left]. Higher regularization produces more diffuse filters and is harder to fool [right]”

This result is not readily explicable by the linear explanation of [1].

Results

To resolve the previous misconceptions and explain the phenomenon of adversarial examples in linear classification, we introduce a number of ideas – some of which we thought were novel and worth sharing.

For instance, we show that:

The level of regularization used controls the scaling of the loss function.
Varying the scaling of the loss function balances two objectives: the minimization of the error distance and the maximization of the adversarial distance.
The adversarial distance has a simple geometric interpretation: d_adv = ½ ||j-i|| cos(theta)

Or, in short: L2 regularization controls the angle between the learned classifier and the nearest centroid classifier, resulting in a simple picture of the phenomenon of adversarial examples in linear classification.

Limits

In the second part of the article, we apply our new insights to non-linear classification. We observe that weight decay still acts on the scaling of the loss function and can therefore be interpreted as a form of adversarial training. We test this hypothesis on a very simple problem (LeNet on MNIST) and show that weight decay has indeed a significant influence on the robustness of our model.

Admittedly, the phenomenon is likely to be more complicated with deeper networks on more sophisticated datasets (more non-linearities and other forms of regularization at play). But we still think that our discussion of LeNet on MNIST constitutes a small step towards a better understanding of the phenomenon: it shows that our analysis does not completely break down as soon as we introduce some non-linearities, and it shows that L2 weight decay plays a more significant role than previously suspected (at least in this simple setup). Our hope is that this result will encourage further investigations of the relation between regularization and adversarial examples in deep networks.

Specific criticisms

Reviewer B

I. It makes claims (at least suggestively) about adversarial examples that I think are wrong.

that adversarial examples are primarily due to tilting of the decision boundary --- in high dimensions, every decision boundary (tilted or not) might have trouble avoiding adversarial examples

Maybe at a certain point, the problem really becomes a semantic one: what do we choose to call an adversarial example? In their seminal paper, Szegedy et al. [6] defined adversarial examples as the result of applying “an imperceptible non-random perturbation to a test image”. Adversarial perturbations are also typically difficult to interpret (as mentioned briefly in Goodfellow et al. [1]: “this perturbation is not readily recognizable to a human observer as having anything to do with the relationship between 3s and 7s.”).

These two conditions are met in linear classification when the boundary is strongly tilted:

tilted

But not when the boundary is not tilted (i.e. for the nearest centroid classifier). In that case, the perturbations become highly visible, and easy to interpret (as a difference of centroids):

not_tilted

The first case is counter-intuitive and necessitates an explanation. The second case is hardly surprising. In my opinion, the images in the second case should not be called “adversarial examples” but should instead be considered as “fooling images”: non-digit images which are recognized as digits with high confidence (a phenomenon more akin to the one discussed by Nguyen et al. [7]). If we make this distinction, then we can reasonably claim that in linear classification, “adversarial examples are primarily due to the tilting of the decision boundary”.

that weight decay for deep networks confers robustness to adversarial examples – weight decay seems to be too crude an instrument, and conveys only limited robustness

We agree that weight decay is a relatively crude instrument, and we tried to be transparent about the fact that, although we do believe that weight decay constitutes an effective regularizer against adversarial examples for LeNet on MNIST, this result is unlikely to generalize completely to state-of-the-art networks on more sophisticated datasets.

The text may still give the impression that we make unreasonable claims and we will try to improve this aspect further in our revisions.

that "the fact that [neural networks] are often vulnerable to linear attacks of small magnitude suggests that they are strongly under-regularized" --- many authors have found that neural networks actually underfit the data during adversarial training and need additional capacity to successfully fit adversarial perturbations.

This is an interesting remark. This observation does indeed suggest that neural networks present some symptoms of underfitting. Yet, they also clearly show some symptoms of overfitting, as emphasized for instance by the result of Zhang et al. [8]: neural networks often converge to zero training error, even on a random labelling of the data. Perhaps these two views are compatible: neural networks may need additional capacity to successfully fit adversarial perturbations, but they may also need additional regularization to help use the additional capacity in a meaningful way.

II. Its coverage of the related work is not very good.

Our limited coverage of related work was mainly due to space considerations but I would be happy to expand further. I spent the month of November writing a literature review for my MPhil to PhD transfer report, and I've tried to keep the same writing style as for the Distill post. Some parts of it could potentially be polished and turned into a section or added as an appendix.

III. It uses language (such as "A new angle" and "We challenge this intuition") that seems to claim substantially more novelty than there actually is. I don't mind an expository paper that is not novel, but I do mind over-claiming.

I understand your concern and I do agree that over-claiming is generally harmful and should be avoided. However, I thought that some of our ideas where indeed novel. For instance, I don't think it has been observed before that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier (hence the phrase: “a new angle”).

Reviewer C

It's hard to tell what the overall goal of this piece is: a pedagogical explanation of the topic, or a new results paper arguing that people were wrong to reject weight decay as a defense against adversarial examples in the past?

The overall goal of the piece is to provide an explanation of the adversarial example phenomenon in linear classification (summarized in conclusion: “our main goal here was to provide a clear and intuitive picture of the phenomenon in the linear case, hopefully constituting a solid base from which to move forward.”)

As emphasized before, we do not consider this piece to be purely pedagogical: clarity is important to us, but we also introduce a number of new ideas. In particular, we show that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier.

Whether it's meant to be pedagogical or advocacy, any discussion of weight decay should probably be in the context of label smoothing (advocated for adversarial examples by David Warde-Farley) and entropy regularization (advocated for adversarial examples by Takeru Miyato and Ekin Cubuk in separate papers).

Thank you for the references. I will try to add a comparison between these works and ours.

Just wrong: "In practice, using an appropriate level of regularization helps avoid overfitting and constitutes a simple form of adversarial training." → weight decay isn't the same as adversarial training. The phrase "adversarial training" originates in "Explaining and Harnessing Adversarial Examples" by Goodfellow et al. Section 5 of that paper compares adversarial training to weight decay and shows how they are different things.

It is true that weight decay and adversarial training are not the same thing, but they share some similarities. In particular, both of them can be seen as a way of attributing penalties to correctly classified images during training (by moving them across the boundary with adversarial training, and by rescaling the loss function with weight decay). This is why we call weight decay “a form of adversarial training” or that we use phrases such as “the type of first-order adversarial training that L2 regularization implements”.

The first plot has no axis labels and I just don't get what's going on. Why do all the data points move around when I change the regularization? I would expect the decision boundary to move, not the data points.

Thank you for your question. This first plot is very important in my view and I realize now that I may have failed to explain it clearly. I am planning to do a number of modifications to improve this.

Let me try to explain it again here.

Consider the problem of classifying 2s versus 3s MNIST digits.

z is the weight vector of the nearest centroid classifier.
(w, b) is an SVM model for a given level of L2 regularization.

There exists a plane containing z and w: we call it the tilting plane of w. We can find a vector n such that (z,n) is an orthonormal basis of the tilting plane of w by using the Gram-Schmidt process: n = normalize(w – (w.z) z).

We can then project the training data in (z,n) and we obtain something that looks like this:

3vs7

The horizontal direction passes through the two centroids and the vertical direction is chosen such that w belongs to the plane (the hyperplane boundary simply appears as a line). Remark also that since (z,n) is an orthonormal basis, the distances in this plane are actual pixel distances.

Now, we obtain the first animation (and the two related ones from the section “Example: SVM on MNIST”) by repeating this process 81 times with the regularization parameter lambda varying between 10^-1 and 10^7 (the exponent increasing by steps of 0.1). Remarkably, the tilting angle between z and w varies monotonically with lambda.

To understand why the data points appear to be moving around when lambda varies, one needs to imagine the tilting plane rotating around z in the n-dimensional input space (thus showing a different section of the n-dimensional training data for each value of lambda).

This idea can be illustrated with the following simplified scenario: z is the weight vector of the nearest centroid classifier. w1 is the weight vector of an SVM model trained with high regularization (lambda = 10^5). w2 is the weight vector of an SVM model trained with low regularization (lambda = 10^-1). w_theta rotates from w1 to w2.

Using the Gram-Schmidt process again, we find the vectors e1 and e2 such that (z,e1,e2) forms an orthonormal basis of the 3D subspace containing z, w1 and w2 (and by definition, w_theta): e1 = normalize(w1 – (w1.z) z) e2 = normalize(w2 – (w2.z) z – (w2.e1) e1)

We then project the training data in (z,e1,e2) and consider the boundaries defined by w1 and w2 (in light grey) and the boundary defined by w_theta (in orange). Below, we observe the space from a viewpoint that is orthogonal to z and w_theta for fives different values of theta:

2vs3_distill

Although the 3D data is static, the points appear to be moving around because the tilting plane and the viewpoint are rotating around z (we see how the adversarial distance decreases as w_theta tilts from w1 to w2).

In the first animation, the situation is more complex because the 81 defined weight vectors span a subspace that is more than 3-dimensional. This subspace can no longer be visualized, but the projections of the training data into the tilting plane still can.

Szegedy et al 2013 also experimented with using weight decay to resist adversarial examples and found that it didn't work. That isn't discussed in this article.

I am not sure to what experiment you are referring specifically.

For a linear classifier, Szegedy et al actually observed a direct relation between the value of the regularization parameter lambda and the average minimum distortion: FC(10^-4) → 0.062 FC(10^-2) → 0.1 FC(1) → 0.14 which seems to be consistent with our results. We expect lower regularization levels to lead to even smaller average minimum distortions (the values of lambda reported here are not directly comparable to ours).

The experiments advocating for weight decay don't really report clear metrics compared to other methods in the literature (I don't see anything like error rate for a fixed epsilon).

There are two conceivable ways of evaluating the robustness of a model to adversarial perturbations. As suggested above, most authors fix the size of the perturbation (epsilon) and report an error rate. Here we choose to fix the confidence level (median value of 0.95) and report the size of the perturbation instead (we find it more adapted to the visual evaluation task that we focus on). Arguably, both approaches have advantages and disadvantages.

One more thought: they really should use SVHN instead of MNIST. MNIST is basically solved for norm-constrained adversarial examples now. There's a trivial solution where you just threshold each pixel at 0.5 into a binary value, and the latest training algorithms are able to discover this solution. SVHN is not a lot more computationally demanding than MNIST but doesn't have this trivial solution.

We do have some results with a Network in Network architecture trained on SVHN. Overall, they suggest that weight decay does play a role and the minimum distortion tends to be higher and more meaningful for the network trained with higher weight decay.

weight decay = 0, test error = 8.1%

svhn_nin4_weightdecay0

weight decay = 0.005, test error = 7.1%

svhn_nin4_weightdecay0 005

However, it is difficult to know exactly what is going on there:

NiNs are typically trained with additional explicit regularizers (such as Dropout or Batch norm) which may interfere with our study of weight decay.
Even without other explicit regularizers, stochastic gradient descent may act as an implicit regularizer [8]. Its dynamics are complex and can be influenced by several parameters (Learning rate? Momentum? Batch size? Weight initializations? Early stopping? Etc)

For these reasons, LeNet on MNIST appeared as a simpler model to study as a first step.

In fact, what puzzles me most about the results with the NiN on SVHN is that even without weight decay, the adversarial perturbations tend to be much larger than those affecting models trained on ImageNet. In future work, I am planning to study in more details under what conditions neural networks become more vulnerable to adversarial perturbations.

[1] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). [2] Kereliuk, Corey, Bob L. Sturm, and Jan Larsen. "Deep learning and music adversaries." IEEE Transactions on Multimedia 17.11 (2015): 2059-2071. [3] Warde-Farley, David, and Ian Goodfellow. "11 Adversarial Perturbations of Deep Neural Networks." Perturbations, Optimization, and Statistics (2016): 311. [4] Nayebi, Aran, and Surya Ganguli. "Biologically inspired protection of deep networks from adversarial attacks." arXiv preprint arXiv:1703.09202 (2017). [5] Anonymous. "Thermometer Encoding: One Hot Way To Resist Adversarial Examples." International Conference on Learning Representations (2018). Under review. [6] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013). [7] Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [8] Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).

thomas-tanay / post--L2-regularization

Response to Reviewers B and C #14