Closed tileb1 closed 3 years ago
In the paper, we reported the throughput in the inference stage, this can be done by comparing the numbers reported in different papers. Comparing the throughput in the training stage is hard for large models in different systems.
In your experiment setup, the main difference comes from the batch size: 200 vs 48, which results in the running time difference. When the batch size is fixed as the same, the running time should be comparable. One thing to note: larger batch size does not necessarily improve the results (sometime yielding worse results than small batch size in my experiments), instead the number of training iterations matter more.
To reproduce results in Table 1, I used batch size=512 and epoch=300. It seems that you are using epoch=100.
More details for this experiment, I used 16GPUs, the mixcut augmentation is always off:
2021-05-08 06:27:14 arch: swin_tiny 2021-05-08 06:27:14 batch_size_per_gpu: 32 2021-05-08 06:27:14 cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml 2021-05-08 06:27:14 clip_grad: 3.0 2021-05-08 06:27:14 dist_url: env:// 2021-05-08 06:27:14 epochs: 300 2021-05-08 06:27:14 freeze_last_layer: 1 2021-05-08 06:27:14 global_crops_scale: (0.4, 1.0) 2021-05-08 06:27:14 gpu: 0 2021-05-08 06:27:14 local_crops_number: 8 2021-05-08 06:27:14 local_crops_scale: (0.05, 0.4) 2021-05-08 06:27:14 local_rank: 0 2021-05-08 06:27:14 lr: 0.0005 2021-05-08 06:27:14 min_lr: 1e-06 2021-05-08 06:27:14 momentum_teacher: 0.996 2021-05-08 06:27:14 norm_last_layer: False 2021-05-08 06:27:14 num_workers: 10 2021-05-08 06:27:14 optimizer: adamw 2021-05-08 06:27:14 opts: [] 2021-05-08 06:27:14 out_dim: 65536 2021-05-08 06:27:14 patch_size: 16 2021-05-08 06:27:14 rank: 0 2021-05-08 06:27:14 saveckp_freq: 20 2021-05-08 06:27:14 seed: 0 2021-05-08 06:27:14 teacher_temp: 0.07 2021-05-08 06:27:14 use_bn_in_head: False 2021-05-08 06:27:14 use_dense_prediction: True 2021-05-08 06:27:14 use_fp16: True 2021-05-08 06:27:14 warmup_epochs: 10 2021-05-08 06:27:14 warmup_teacher_temp: 0.04 2021-05-08 06:27:14 warmup_teacher_temp_epochs: 30 2021-05-08 06:27:14 weight_decay: 0.04 2021-05-08 06:27:14 weight_decay_end: 0.4 2021-05-08 06:27:14 world_size: 16 2021-05-08 06:27:14 zip_mode: True
2021-05-08 06:56:25 Epoch: [0/300] Total time: 0:28:45 (0.689572 s / it) 2021-05-08 06:56:25 Averaged stats: loss: 5.194830 (7.439698) lr: 0.000100 (0.000050) wd: 0.040010 (0.040003)
Ok I see, thank you @ChunyuanLI !
Hello, I have read your paper and found it very interesting. I was particularly intrigued by Table 1 where you compare the throughput against other methods, including DINO with a deit_tiny and patch size of 16. From the table, EsViT with Swin-T(/W=7) has a throughput of 808 and DINO with DeiT-T/16 has 1007. So I expected EsViT to be +- slower by 20%. Yet, when I run both I do not get this. I attached both logs below.
DINO
EsViT
So EsViT (with swin_tiny W=7) is about 3 times slower than DINO (with deit_tiny and P=16). This is run on a machine with 4xV100 GPUs. In both cases, I set the batch size to the +- highest value I could without having out of memory exceptions.
Is it the case that my run of EsViT should be this row in table 1?
If so, do you know why I am getting such contradictory results?
Thank you!