nmndeep / revisiting-at

[NeurIPS 2023] Code for the paper "Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models"
37 stars 3 forks source link

imagenet1k and imagenet21k pre-train #8

Open xiaoyunxxy opened 5 months ago

xiaoyunxxy commented 5 months ago

Hi, thank you for your nice work, it's really enlightening. Especially, using clean pre-trained checkpoints on ImageNet-1K and -21K.

I found that you use: elif modelname == 'vit_s': model = create_model('vit_small_patch16_224', pretrained=pretrained) and elif modelname == 'vit_s_21k': model = create_model('deit3_small_patch16_224_in21ft1k', pretrained=pretrained) to load pre-trained checkpoints.

Do you mean "vit_small_patch16_224" for ImageNet-1K pretraining? But I print the model.default_cfg, it shows the model is timm/vit_small_patch16_224.augreg_in21k_ft_in1k. So it actually loads a checkpoint pre-trained on ImageNet-21K?

nmndeep commented 5 months ago

Hi, For training with ViT-S, we used deit_s from here And the specific key wasdeit_small_patch16_224 - this should also be listed in the appendix. I think the confusion stems from the fact that we also have vit_s listed in the same file - I will clean this soon. Hoe this helps.

xiaoyunxxy commented 5 months ago

Hi, Thank you for your reply. But it's still very unclear to me why the 21k-pre-trained models don't provider better results. Because I tried 1k-pre-trained and 21k-pre-trained as start point for PGD fine-tuning (50 epochs). The hyper-parameter I used is the same as that in your paper. I got 69.03, 39.70 (1K) of clean acc and PGD-20 acc, 71.48, 42.18 (21k). The difference is quite obvious. Is there any possible reason for that? Thank you for your time.

nmndeep commented 5 months ago

Hi, I cannot say surely what the reason for certain behaviour is and I do not know the full setting you worked with. Note, we used 2-step APGD not PGD (although the difference in 2-step setting between the two might be marginal) and full AA for evaluation. For short training (like 50 epochs) we saw some improvement (Table-15 in appendix for ConvNext) - tested again with full AA, although we did not test it for ViT-S. There could be some additional influence of model-size/initialisation with the training regime on robustness that is not captured by our results. For the larger training regime, perhaps the model changes quite significantly from the initialisation that small improvements in clean accuracy (during init.) don't translate after adversarial training.

xiaoyunxxy commented 5 months ago

Hi, thank you for your explanation. I tried training ViT-b (50 epoch) with your code, using ImageNet-1K pre-trained model, to reproduce the the results for ViT-B 50 epoch (73.3, 50.0). But I got very low results, clean acc 67.06, AA acc 39.52. I think maybe I used wrong hyperparameters. Could you please help me check the params.json?

Below is the hyperparameters. {"model.arch": "vit_b_1n1k", "model.pretrained": 1, "model.ckpt_path": "", "model.add_normalization": 0, "model.not_original": 0, "model.updated": 0, "model.model_ema": 0.0, "model.freeze_some": 0, "model.early": 1, "resolution.min_res": 224, "resolution.max_res": 224, "resolution.end_ramp": 0, "resolution.start_ramp": 0, "data.train_dataset": "../../data/imagenet/train/", "data.val_dataset": "../../data/imagenet/val/", "data.num_workers": 12, "data.in_memory": 1, "data.seed": 0, "data.augmentations": 0, "lr.step_ratio": 0.1, "lr.step_length": 30, "lr.lr_schedule_type": "cosine", "lr.lr": 0.001, "lr.lr_peak_epoch": 10, "logging.folder": "./imagenet/", "logging.log_level": 2, "logging.save_freq": 2, "logging.addendum": "re_vit_b_50epoch", "validation.batch_size": 64, "validation.resolution": 224, "validation.lr_tta": 0, "validation.precision": "fp16", "training.eval_only": 0, "training.batch_size": 64, "training.optimizer": "adamw", "training.momentum": 0.9, "training.weight_decay": 0.05, "training.epochs": 50, "training.label_smoothing": 0.1, "training.distributed": 1, "training.use_blurpool": 0, "training.precision": "fp16", "dist.world_size": 8, "dist.address": "localhost", "dist.port": "12355", "adv.attack": "apgd", "adv.norm": "Linf", "adv.eps": 0.01568627450980392, "adv.n_iter": 2, "adv.verbose": 0, "adv.noise_level": 1.0, "adv.skip_projection": 0, "adv.alpha": 1.0, "misc.notes": "", "misc.use_channel_last": 1}

I use this line of code to load in1k pre-trained checkpoint: model = create_model('vit_base_patch16_224.augreg_in1k', pretrained=pretrained)

nmndeep commented 5 months ago

Hi, We do post-hoc best checkpoint selection with a few step APGD(try 1 or 5-step) for Linf-norm on a separate validation set as opposed to the RobsutBench test set. I do not know if you are doing this. Also, I am not sure if you are using the same data augmentations as we list in the paper. Some thing in your params I see different are: model_ema is set to 0, should be 1. Training-effective batchsize for this setup was 864 (96x9) in distributed setting (this should ideally have a minimal impact if your effective BS is more or less the same as us). I will try to recover the params.json for this particular setting - but I am not sure this is available anymore. hope this helps.