by Nicholas Frosst, Sara Sabour, Geoffrey Hinton
paper link
In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule
We show that setting a threshold on the l2 distance between the input image and its reconstruction
from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax.
We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the ”adversarial” image resemble images of the other class.
Our detection method can be defeated by a stronger white-box attack that uses a method (R-BIM) that takes the reconstruction error into account and iteratively perturbs the image so as to allow good reconstruction. However, this stronger attack does not produce typical adversarial images that look like the original image but with a small amount of added noise.
Since the reconstruction distance is also differentiable we modify BIM into R-BIM which additionally minimizes the reconstruction distance. R-BIM is designed specifically to break DARCCC. Fig. 5 visualizes the initial input and the result of 100 steps of R-BIM with a target class of ‘0’ for 10 random SVHN images. We see that indeed several of the crafted examples look like ‘0’s. Effectively they are not adversarial images at all since they resemble their predicted class to the human eye.
by Nicholas Frosst, Sara Sabour, Geoffrey Hinton paper link
In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule
We show that setting a threshold on the l2 distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax.
We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the ”adversarial” image resemble images of the other class.
Our detection method can be defeated by a stronger white-box attack that uses a method (R-BIM) that takes the reconstruction error into account and iteratively perturbs the image so as to allow good reconstruction. However, this stronger attack does not produce typical adversarial images that look like the original image but with a small amount of added noise.
Since the reconstruction distance is also differentiable we modify BIM into R-BIM which additionally minimizes the reconstruction distance. R-BIM is designed specifically to break DARCCC. Fig. 5 visualizes the initial input and the result of 100 steps of R-BIM with a target class of ‘0’ for 10 random SVHN images. We see that indeed several of the crafted examples look like ‘0’s. Effectively they are not adversarial images at all since they resemble their predicted class to the human eye.