Throughput comparison (Table 1)

tileb1 commented 3 years ago

Hello, I have read your paper and found it very interesting. I was particularly intrigued by Table 1 where you compare the throughput against other methods, including DINO with a deit_tiny and patch size of 16. From the table, EsViT with Swin-T(/W=7) has a throughput of 808 and DINO with DeiT-T/16 has 1007. So I expected EsViT to be +- slower by 20%. Yet, when I run both I do not get this. I attached both logs below.

DINO

arch: deit_tiny
batch_size_per_gpu: 200
clip_grad: 3.0
data_path: /ilsvrc2012/ILSVRC2012_img_train
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: 8
local_crops_scale: (0.05, 0.4)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
momentum_teacher: 0.996
norm_last_layer: True
num_workers: 24
optimizer: adamw
out_dim: 65536
output_dir: output_dir
patch_size: 16
rank: 0
saveckp_freq: 10
seed: 0
teacher_temp: 0.04
use_bn_in_head: False
use_fp16: True
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 0
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
Data loaded: there are 1281167 images.
Student and Teacher are built: they are both deit_tiny network.
Loss, optimizer and schedulers ready.
Starting DINO training !

Epoch: [0/100] Total time: 0:38:22 (1.438374 s / it)
Averaged stats: loss: 6.691907e+00 (8.885959e+00)  lr: 1.551861e-04 (7.808108e-05)  wd: 4.008760e-02 (4.002958e-02)

EsViT

aa: rand-m9-mstd0.5-inc1
arch: swin_tiny
aug_opt: dino_aug
batch_size_per_gpu: 48
cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
clip_grad: 3.0
color_jitter: 0.4
cutmix: 1.0
cutmix_minmax: None
data_path: /ilsvrc2012/ILSVRC2012_img_train
dataset: imagenet1k
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: (8,)
local_crops_scale: (0.05, 0.4)
local_crops_size: (96,)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
mixup: 0.8
mixup_mode: batch
mixup_prob: 1.0
mixup_switch_prob: 0.5
momentum_teacher: 0.996
norm_last_layer: False
num_mixup_views: 10
num_workers: 10
optimizer: adamw
opts: []
out_dim: 65536
output_dir: output_dir
patch_size: 16
pretrained_weights_ckpt: 
rank: 0
recount: 1
remode: pixel
reprob: 0.25
resplit: False
sampler: distributed
saveckp_freq: 5
seed: 0
smoothing: 0.0
teacher_temp: 0.07
train_interpolation: bicubic
tsv_mode: False
use_bn_in_head: False
use_dense_prediction: True
use_fp16: True
use_mixup: False
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 30
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
zip_mode: False
Data loaded: there are 1281167 images.
=> merge config from experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
Unknow architecture: swin_tiny
Student and Teacher are built: they are both swin_tiny network.
Loss, optimizer and schedulers ready.
Starting training of EsViT ! from epoch 0

Epoch: [0/100] Total time: 2:09:19 (1.162958 s / it)
Averaged stats: loss: 4.714716 (6.780889)  lr: 0.000037 (0.000019)  wd: 0.040089 (0.040030)

So EsViT (with swin_tiny W=7) is about 3 times slower than DINO (with deit_tiny and P=16). This is run on a machine with 4xV100 GPUs. In both cases, I set the batch size to the +- highest value I could without having out of memory exceptions.

Is it the case that my run of EsViT should be this row in table 1?

EsViT, Swin-T 28 808 78.1 75.7

If so, do you know why I am getting such contradictory results?

Thank you!

ChunyuanLI commented 3 years ago

In the paper, we reported the throughput in the inference stage, this can be done by comparing the numbers reported in different papers. Comparing the throughput in the training stage is hard for large models in different systems.

In your experiment setup, the main difference comes from the batch size: 200 vs 48, which results in the running time difference. When the batch size is fixed as the same, the running time should be comparable. One thing to note: larger batch size does not necessarily improve the results (sometime yielding worse results than small batch size in my experiments), instead the number of training iterations matter more.

To reproduce results in Table 1, I used batch size=512 and epoch=300. It seems that you are using epoch=100.

ChunyuanLI commented 3 years ago

More details for this experiment, I used 16GPUs, the mixcut augmentation is always off:

2021-05-08 06:27:14 arch: swin_tiny 2021-05-08 06:27:14 batch_size_per_gpu: 32 2021-05-08 06:27:14 cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml 2021-05-08 06:27:14 clip_grad: 3.0 2021-05-08 06:27:14 dist_url: env:// 2021-05-08 06:27:14 epochs: 300 2021-05-08 06:27:14 freeze_last_layer: 1 2021-05-08 06:27:14 global_crops_scale: (0.4, 1.0) 2021-05-08 06:27:14 gpu: 0 2021-05-08 06:27:14 local_crops_number: 8 2021-05-08 06:27:14 local_crops_scale: (0.05, 0.4) 2021-05-08 06:27:14 local_rank: 0 2021-05-08 06:27:14 lr: 0.0005 2021-05-08 06:27:14 min_lr: 1e-06 2021-05-08 06:27:14 momentum_teacher: 0.996 2021-05-08 06:27:14 norm_last_layer: False 2021-05-08 06:27:14 num_workers: 10 2021-05-08 06:27:14 optimizer: adamw 2021-05-08 06:27:14 opts: [] 2021-05-08 06:27:14 out_dim: 65536 2021-05-08 06:27:14 patch_size: 16 2021-05-08 06:27:14 rank: 0 2021-05-08 06:27:14 saveckp_freq: 20 2021-05-08 06:27:14 seed: 0 2021-05-08 06:27:14 teacher_temp: 0.07 2021-05-08 06:27:14 use_bn_in_head: False 2021-05-08 06:27:14 use_dense_prediction: True 2021-05-08 06:27:14 use_fp16: True 2021-05-08 06:27:14 warmup_epochs: 10 2021-05-08 06:27:14 warmup_teacher_temp: 0.04 2021-05-08 06:27:14 warmup_teacher_temp_epochs: 30 2021-05-08 06:27:14 weight_decay: 0.04 2021-05-08 06:27:14 weight_decay_end: 0.4 2021-05-08 06:27:14 world_size: 16 2021-05-08 06:27:14 zip_mode: True

2021-05-08 06:56:25 Epoch: [0/300] Total time: 0:28:45 (0.689572 s / it) 2021-05-08 06:56:25 Averaged stats: loss: 5.194830 (7.439698) lr: 0.000100 (0.000050) wd: 0.040010 (0.040003)

tileb1 commented 3 years ago

Ok I see, thank you @ChunyuanLI !

microsoft / esvit

Throughput comparison (Table 1) #1

DINO

EsViT