taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
278 stars 21 forks source link

Training log available? #7

Closed simonJJJ closed 1 year ago

simonJJJ commented 1 year ago

Hi, thanks for the great work!

I wonder is there any training log available for clip pretrained models?

taoyang1122 commented 1 year ago

Hi @simonJJJ , thanks for your interest in our work. I may not be able to provide the training logs because I don't have access now.

simonJJJ commented 1 year ago

The vitclip_large_k400 config is not consistent with the paper, i.e. training num_frames, training frame_interval, ColorJitter, backbone lr_mult, warmup epochs etc.

I simply run the vitclip_large_k400 config in your repo but get the top1 acc = 85.69. So I want to know the strictly correct config.

Thanks.

taoyang1122 commented 1 year ago

Hi, sorry we missed some implementation details in the paper. For ViT-L on K400, we use ColorJitter and 0.1x backbone lr to alleviate overfitting. I updated the config. You may try it again. The configs are for 8GPU batchsize=64. Another possible reason for the performance is that the K400 videos may be different.

simonJJJ commented 1 year ago

Hi,

I directly evaluate your pretrained model ViT-L/14 32x3x1 on K400 by using the updated config that you fix.

However, I get the top1 acc = 86.23. Add the ThreeCrop for infer, the top1 acc = 86.69. The result is still far from the paper reported top1 acc=87.5. My validation set has 19877 valid videos.

simonJJJ commented 1 year ago

After email discussions with co-authors, it turns out the cause is the different validation set. With K400 val set from link, I can reproduce the ViTClip-L result with top-1 acc=87.3 and 87.61 (w/ ThreeCrop).

Hope it's helpful to others.