openphilanthropy / unrestricted-adversarial-examples

Contest Proposal and infrastructure for the Unrestricted Adversarial Examples Challenge
Apache License 2.0
327 stars 62 forks source link

Contest proposal and release of training code and paper #55

Closed AngusG closed 5 years ago

AngusG commented 5 years ago

The proposal mentions that defense training code must be made public, which I agree will help make attackers as efficient as possible. I expect that some defenses will involve a careful combination of architecture and hyper-parameter selection, therefore a potential unintended side effect of this is that it could discourage defenders if someone else shows up at the last minute with more resources and runs more or less the same code with additional hyper-parameter tuning.

Given that weights and test time code will already be public, would a reasonable compromise be to submit the training code to the review board only, for release at the conclusion of each 90 day period? This might also limit attempts to obfuscate the code.

I'm assuming the process of staking a defense with an accompanying paper is meant to prevent the main scenario outlined above, but can you provide a bit more information about how this works in the case of seemingly similar defenses. And if a defense is broken (the most probable outcome), how are re-submissions treated in terms of what is considered to be the same defense. I.e., would it be acceptable to tune the hyper-parameters of the previously broken defense, update the arxiv paper, and re-submit if it is believed that the new model has a good chance of withstanding the latest attacks?

carlini commented 5 years ago

You raise a good point. I hadn't carefully considered that before.

Let me offer a few potential counter-arguments. I haven't thought them through carefully yet, so I don't know if I agree with them, but here are the responses I could imagine.

  1. There are no fixed 90-day periods. Rather, the clock starts 90 days after a defense becomes staked. So in order for someone else to succeed by taking a previous person's code, they would have to break the previous defense. Otherwise, it would win first.

  2. If a defense with one setting of hyper-parameters is broken, defenses with other settings are probably also broken. It might be unlikely that additional hyper-parameter tuning will cause a defense to go from not-zero error rate, to zero error rate.

  3. Stealing someone else's idea, even without code, is still possible. It would just take more work to re-implement it. We'll have to consider how to handle cases of "idea stealing" regardless, and I don't know what the right answer is here.

Probably the worst-case scenario in my mind is a broken defense winning the contest. This has been the driving force behind most of our design decisions. We make it easy for the defenders where it helps this end goal: the task is "easy" (trivial for humans) and defenses can abstain on everything that is adversarial. But we also make it hard: a single error is enough to lose and we require they release all details about their design, so that it's not some security-through-obscurity defense that wins.

However, you're definitely right we'll have to come up with some process for near-duplicate submissions. I don't know what that is yet.

nottombrown commented 5 years ago

Thanks for raising this issue @AngusG.

I agree with @carlini that the worst-case scenario for this whole project would be if a broken defense wins the contest, and that most of our design should be structured around avoiding that scenario.

For the issue of near-duplicate submissions, one solution could be to give the review board the discretion to select the original submission and the near duplicate winner as "joint winners". My intuition is that this outcome is quite unlikely, because of reason #2 that @carlini described.

AngusG commented 5 years ago

I definitely agree that security-through-obscurity defenses should not be able to win, and reasonable measures should be taken to prevent this.

The combination of 1) no time limit per defense, and having to write a paper, seem like fairly strong deterrents for near duplicate submissions. At the same time, I wouldn't underestimate 2) the impact of deliberate hyper-parameter selection. I agree that a random/grid search based on a previously broken defense is unlikely to ameliorate things much.

It does seem like some kind of credit assignment system should be in place though, given that the first round of defenses are all likely to fail as people feel things out, with the most promising ones becoming the new baseline for the next round. I'm not sure how much money is on the line, but one option would be to conduct a brief interview in the event of "joint winners" to have each team explain how the defense works (from this it should be apparent which team did the bulk of the work).

Maybe there could be mini-rewards for defense/attack papers (which become related work only when previously submitted to the competition) that are cited by the winning submission, with steep penalties (potential loss of reward) for failing to cite work which was built upon.

Anyway, kudos for organizing as much so far, setting up a competition like this is definitely not straightforward, but very valuable to the community.

catherio commented 5 years ago

One thing I'm hearing you express is a desire for a way for defenders "feel things out" in the open arena, without necessarily making all your early prototypes available for others to easily build off.

I'm curious if something like the following would resolve your concern. Allow a "closed" defense submission mode, in which the pre-trained model and weights are released, but not the code, hyperparameters, nor any data required to train it. If staked (by the review board, or by the submitter), successful attacks against a closed defense can still be granted the attacker prize. However, closed defense submissions cannot win the competition. A closed defense can be re-submitted as an open defense, in which case the 90-day timer starts and it becomes eligible to win.

AngusG commented 5 years ago

@catherio Yep, something like this would be really useful and resolves most of my concerns. I understand if the other suggestions are too much of a distraction from the main goal of discovering new attacks/defenses that work well in a practical scenario.

Something else that came to mind is that the rules mention "no confident mistakes", but I didn't see a confidence threshold for mistakes.

E.g., from Section 1.1

Models are allowed to abstain (by returning low-confidence predictions) on any adversarial input

Do the organizers have an equivalency for the abstain mechanism in mind, if this is not implemented? Or may defenders provide an arbitrary threshold below which the model is said to abstain?

The second statement is based on my interpretation of:

we allow defenses to choose when to abstain using any mechanism they desire.

Are confident mistakes on the clean eligibility dataset acceptable provided the model achieves at least 80% accuracy?

carlini commented 5 years ago

Models will return one of {bird, bicycle, abstain} for each input. The requirements are:

nottombrown commented 5 years ago

Addressed in #58