Computation time - Githubissues

locuslab / convex_adversarial

A method for training neural networks that are provably robust to adversarial attacks.

MIT License

377 stars 84 forks source link

Computation time #30

Open sungyoon-lee opened 4 years ago

sungyoon-lee commented 4 years ago

Hi, I've run mnist.py on a single Titan X (Pascal) with the default settings. However, the speed is much slower(x3) than that reported in the literature (Table 1). Scaling provable adversarial defenses My attempt(=0.19*1200=230s/epoch) vs Report(=74s/epoch)

I think the only difference is that I'm using pytorch 1.4.0 and I've changed the code dual_layers.py (not using 'view' but using 'reshape').

riceric22 commented 4 years ago

Hi Sungyoon,

I don't currently have access to a Titan X to verify this exactly, but are you running the script with exact bound computation? The numbers in the paper reflect the use of random Cauchy projections described in section 3.2 (I believe with 50 random projections). Running the exact bound computation will of course be slower.

~Eric

sungyoon-lee commented 4 years ago

@riceric22 Thank you for the quick response. I've run like followings: server:\~/convex_adversarial$ python examples/mnist.py Also, I've tried with the argument, proj=50 server:\~/convex_adversarial$ python examples/mnist.py --proj 50 But this also has a similar speed (=0.18x1200=216s/epoch). I think it is slow because I use a single GPU instead of 4. When I tried with server:\~/convex_adversarial$ python examples/mnist.py --proj 50 --cuda_ids 0,1,2,3 This has a similar speed with that reported in the paper (=0.08x1200=96s/epoch). Moreover, I can't run cifar.py with the default setting because of the memory error, so I have to use the argument cuda_ids=0,1,2,3. But I couldn't run cifar.py for the 'large' network with 4 GPUs, or even with 8 GPUs.

riceric22 commented 4 years ago

Hi Sungyoon,

In addition to adding --proj 50 you also need to specify --norm_train l1_median and --norm_test l1_median to use the median estimator for random projections during training and testing, otherwise it will still compute the exact bound (this is why you see the same speed). I realize this wasn't well documented in the code, thanks for bringing this up. MNIST definitely doesn't need more than one GPU, and also note that for MNIST it's possible to use even fewer random projections (e.g. 10) and still get comparable results.

Computing exact bounds on CIFAR10 does however run out of memory, due to the increased input size. It is not possible in my experience to run the exact bound on more than one example at a time; as a result, during training make sure you use the random projections to get the speeds reported in the paper meant for scaling these approaches.

~Eric

sungyoon-lee commented 4 years ago

@riceric22 Thank you! The code is now running fast, even faster than that reported in the paper (=0.03x1200=36s/epoch). server:\~/convex_adversarial$ python examples/mnist.py --proj 50 --norm_train l1_median --norm_test l1_median However, it causes an error with nan loss (3 trials). And I think it is faster because of the error. Also, there is the same nan loss error for CIFAR-10.

riceric22 commented 4 years ago

It seems that somewhere after PyTorch 1.0, there was an underlying change in PyTorch which introduced NaNs into the projection code, as I'm able to run training normally without NaNs in my PyTorch 1.0 environment but I can reproduce the NaNs in my PyTorch 1.2 environment.

I'll take a look and try to narrow down what happened here, but you should be able to run this normally with PyTorch 1.0

sungyoon-lee commented 4 years ago

Thank you very much, Eric. I tried it on Pytorch 1.0.0 environment, and it works with no error!

pdebartol commented 1 year ago

Did anyone manage to reproduce the cifar experiments in a more recent PyTorch environment (>=1.4.0) without getting NaNs with projections?