openphilanthropy / unrestricted-adversarial-examples

Contest Proposal and infrastructure for the Unrestricted Adversarial Examples Challenge
Apache License 2.0
327 stars 62 forks source link

J. Gilmer's paper, "Motivating the Rules of the Game..." #64

Closed chrisbobbe closed 5 years ago

chrisbobbe commented 5 years ago

I'm so excited for this! Hopefully something in this post will be helpful.

I was glad to see an acknowledgment of Justin Gilmer and "Motivating the Rules of the Game for Adversarial Example Research" (https://arxiv.org/abs/1807.06732), but I wonder if even more can be drawn from this excellent paper, starting from its framework of five action spaces: indistinguishable perturbation, content-preserving perturbation, non-suspicious input, content-constrained input, and unconstrained input. Indicating which one of these is under study would be a great way to eliminate any confusion about the constraints on the attacker (some confusion, I believe, remains: https://github.com/google/unrestricted-adversarial-examples/issues/63). From the abstract downwards, Unrestricted Adversarial Examples seems to say the attacks are not constrained. But, in fact, they are: attacks are limited to recognizable images of either birds or bicycles, as filtered by a panel. I think "unconstrained" was meant to signal the improvement on earlier research where attacks were constrained to "small changes to pre-labeled data points," which is great progress, but it just caused me some slight confusion after reading Gilmer's paper, which assumes even more freedom for the attacker in "unconstrained" scenarios (e.g., unrecognizable inputs are allowed).

But aside from just clarifying the methodological point of which attacks are allowed, the larger idea is that you can then tie the action space (which in this case would be "non-suspicious inputs," I think?) more closely to a real-world scenario where motivated adversaries may actually exist, and explore the implications. So it's probably important to say that this study will not address the problem of a completely random noise example -- as in a hack of a smartphone's facial recognition done in private -- which would be the case if the attacks were entirely unconstrained. But the study will be particularly relevant for situations with human surveillance involved, or, say, the case of fooling a bystander into ignoring an innocent-sounding but adversarial input into a smart speaker like the Amazon Echo.

Another good point from the Motivating paper is to emphasize the danger of losing accuracy on ordinary examples when trying to defend against experts' adversarial examples. I know that private tests with ordinary examples will be done to guard against constant abstaining, but I want to make sure these tests aren't designed for that purpose alone: they have the full responsibility of confirming (ideally) a 100% success rate* over a very large number of examples (with some abstentions allowed), as this is the standard used for the attacks. This is especially important given some selection bias in the attacks; most of them will either have perturbations or be selected to be particularly hard to classify, and I think optimizing to defend against either strategy has the potential side effect, if not handled well, of reducing accuracy on non-adversarial inputs, and I'd count this as a net loss of security. The appearance of "hardness inversion" (symptoms described in the Motivating paper) is a red flag that this might be happening in a given model.

Anyway, I'm so excited to see how this goes, and maybe one day I could participate!

*There seems to be a typo in section 4.2. It says, "To prevent models from abstaining on all inputs, models must reach 80% accuracy on a private eligibility dataset of clean bird-or-bicycle images," which seems to imply that 20% can be misclassified. Elsewhere I've seen it specified that 100% accuracy is required on all non-abstained examples (a much higher bar for accuracy), with a maximum abstention rate (not error rate) of 20%. Aside from this typo (I believe it's a typo), my question of the sufficient number of examples that must be classified correctly (or other factors affecting these tests' thoroughness) still stands, I think.

carlini commented 5 years ago

That's definitely a good point about us not allowing unrecognizable images. Our motivation here was mainly one of helping the defender: this task is going to be exceptionally difficult for defenders to begin with, and by at least constraining the adversary to enforce it must be a valid image we might hope it is somewhat easier.

As someone who mostly does attack work, I completely agree that the threat model of random-noise attacks is definitely realistic---I've even done work in this space---but given the difficulty of even the valid-image attack I don't want to yet allow completely arbitrary images.

The purpose of the test set (and withheld test set) will be to make sure the defender doesn't abstain on too many examples. We're going to try our hardest to make sure that it covers most reasonable domains we can imagine. (And in particular, will include plenty of fresh images that are not present online at all to ensure that someone who just scrapes the internet and builds a k-NN won't succeed.) Some of our images will be somewhat noisy, so classifiers won't be able to abstain on all noise. Some of them will be compressed, so classifiers won't be allowed to abstain on compression.

I completely agree in the importance of doing this well, and we're going to do our best here.

(We'll make sure to clarify that we mean 80% coverage at 100% accuracy. However, technically, as long as the accuracy is 80% then that's still okay because if the classifier happened to make a confident error on one of the test examples then it's likely going to be trivial to attack it and cause this to happen.)

chrisbobbe commented 5 years ago

As long as defense remains very difficult, it makes sense to help the defender this way. Then, more people are motivated to defend, and the more people who defend, the better defenses will be developed. I'm learning fast, but I'm new to adversarial examples, and pretty new to machine learning in general. So at this stage it’s possible for me to be a bit too influenced by strong claims about the best possible defenses, like those in the Motivating paper, since I'm not yet up to speed on where defenses currently are.

I think the same explanation can be given for my comment on hardness inversion; I had this idea that it was very common to overfit on adversarial examples (whether this overfitting is done on particular adversarial examples, or just their general characteristics, I'm not sure), at the expense of accuracy on non-adversarial examples. In practice, I have no idea how common this is. And, as you said in your comment about 80% coverage vs. 80% accuracy, it's probably much more common to see a failure on a non-adversarial example and presume correctly that it will fail on a lot of adversarial examples. I would just point out that in this case, that kind of overfitting would have to be ruled out before making that presumption; does this sound right? And is it relevant (or even true) that we would expect attackers to submit almost zero ordinary examples (easy to classify, no distortions, etc.), to try to exploit such overfitting? Focusing more on the withheld tests only really makes sense if that's the case, and if this overfitting seems substantially likely, and I could be wrong on both counts.

Where would it be most useful for me to try to contribute? I can keep combing through the proposal to find potential areas for improvement (at the risk of raising more false alarms due to my lack of experience), or I can just dive in to building an attack or a defense (it would be my first time doing either). Do you need more people on attack or defense?

hendrycks commented 5 years ago

Our motivation here was mainly one of helping the defender: this task is going to be exceptionally difficult for defenders to begin with, and by at least constraining the adversary to enforce it must be a valid image we might hope it is somewhat easier.

If this assumption was not made, then this challenge would also have out-of-distribution/anomaly detection rolled into it.

carlini commented 5 years ago

Yeah that's another excellent way of looking at it.

Maybe someone should run an adversarial "hotdog"/"not hotdog" contest.

chrisbobbe commented 5 years ago

Haha yeah! Looking at the presence/absence of a single thing (as opposed to classifying between two things) seems like a good direction to explore. I can think more on this this evening but I’m at work now (just let me know if it would be helpful). Just at random, if it’s ever useful to have a source of photos that don’t exist on the internet, and you need to know they haven’t been perturbed or anything, I could set up a quick React Native app that has camera permissions but not photo library permissions and that feeds photos directly to your team. Then those photos can be used either as-is or you could let attackers manipulate them (which would give the interesting constraint of not letting the attacker choose the starting point, but still letting them manipulate the image — I’m sure the Motivating paper could guide choosing restrictions like this)

hendrycks commented 5 years ago

Looking at the presence/absence of a single thing (as opposed to classifying between two things) seems like a good direction to explore.

Having done some exploratory work on this, part of the difficulty is getting the system not to falsely detect random (nonadversarial) noise as the object of interest. Of course, one can train against these noises specifically, but getting it to generalize for free is surprisingly hard (even with density estimation detectors).

chrisbobbe commented 5 years ago

Makes sense. I think this issue has been resolved; feel free to close it, if you think so too, and don't hesitate to contact me (I think my email is publicly viewable) if I can help in any way. This is exciting!