mx-mark / VideoTransformer-pytorch

PyTorch implementation of a collections of scalable Video Transformer Benchmarks.
272 stars 34 forks source link

Example training command/performance #10

Closed Enclavet closed 2 years ago

Enclavet commented 2 years ago

Trying to get top1_acc of >78 as shown in the example log.

Do we know the settings and dataset used for training?

I am training on K400 and using the command in the example: python model_pretrain.py \ -lr 0.005 \ -pretrain 'vit' \ -objective 'supervised' \ -epoch 30 \ -batch_size 8 \ -num_workers 4 \ -arch 'timesformer' \ -attention_type 'divided_space_time' \ -num_frames 8 \ -frame_interval 32 \ -num_class 400 \ -optim_type 'sgd' \ -lr_schedule 'cosine' \ -root_dir ROOT_DIR \ -train_data_path TRAIN_DATA_PATH \ -val_data_path VAL_DATA_PATH

I am unable to get above >73. Increasing frame_interval does not help.

Curious what I can do to get similar performance.

mx-mark commented 2 years ago

@Enclavet First, the results of timesformer shown in this repo are pretrained on the K600, and for k400, it can achieve around 77%. How to get a similar performance largely depends on your hparams. Would you please show me your hparams loged before the training start?

Enclavet commented 2 years ago

Attaching hparams:

Namespace(lr=0.005, epoch=15, gpus=-1, nccl_ifname='lan2', batch_size=8, num_workers=4, log_interval=30, save_ckpt_freq=20, num_class=400, num_samples_per_cls=10000, arch='timesformer', attention_type='divided_space_time', pretrain='vit', optim_type='sgd', lr_schedule='cosine', objective='supervised', resume=False, resume_from_checkpoint=None, num_frames=8, frame_interval=40, seed=0, train_data_path='/home/ec2-user/train_list.txt', val_data_path='/home/ec2-user/val_list.txt', test_data_path=None, root_dir='/home/ec2-user/workdir')

mx-mark commented 2 years ago

@Enclavet The hparams are almostly same with my experiment settings except that i set the epoch to 30 for the consine lr schedule and 32 for frame interval. You can try the default settings to see the final result. By the way, why do you choose 40 for the frame interval? In my opnion, the 32 is enough to cover the entire video frames under a 25fps. So what i think about is how do you perform the data paperation for k400 and have you ever aligned the fps of each video sample?

Enclavet commented 2 years ago

@mx-mark My data comes from this repo: https://github.com/cvdfoundation/kinetics-dataset.

Videos appear to be 10 seconds in length at around 25-30fps (not all the same). Are you doing any more data preparation beyond downloading the video + cutting the relevant section?

As mentioned 32 frame interval with 8 frames should cover most videos and I was using 40 as a test. I have done training with 32 as well and gotten similar performance. Actually the best val acc_top1 was ~75 after 15e, not ~73 as mentioned earlier.

Do you think more epochs will help? I notice that at some point acc does not improve with more epochs and can actually decrease.

mx-mark commented 2 years ago

@Enclavet normally, we will resample the video fps to the same

Enclavet commented 2 years ago

I aligned my dataset for 225 dimensions and 25fps and ran training on K400. I was able to achieve 76 >top1 acc.

Running it with K600 now.

Enclavet commented 2 years ago

Was never able to achieve 78>top1 acc without modifying the num_frames and frame_intervals.

Was able to achieve 78>top1acc on K600 with num_frames = 12 and frame_interval set to 20.

This is with a dataset from https://github.com/cvdfoundation/kinetics-dataset resampled to 25fps and aligned to 225 dimensions.

Closing this as I am happy with the performance.