showlab / EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
https://arxiv.org/pdf/2206.01670.pdf
227 stars 20 forks source link

Commands to MQ Training with VSGN #1

Closed JunweiLiang closed 2 years ago

JunweiLiang commented 2 years ago

Hi, thanks for releasing the code!

Could you provide some instructions on how to run VSGN training with EgoVLP features (hyper-parameters, learning rate, etc.)? Thanks!

Junwei

QinghongLin commented 2 years ago

Hello Junwei,

Thanks for your interest for our work, I will update the instruction and related details for MQ next,

Thank you for your patience!

QinghongLin commented 2 years ago

Hi Junwei,

I have uploaded the video features for MQ tasks to G drive: train&val / test, so that you can download it directly. What you need to do is replace the input features with our features. and I have attached our config of the best VSGN model in here config.txt.

Please try it out and let us know if you have new results.

JunweiLiang commented 2 years ago

I have downloaded the features but they seem to be a single file. Are they a single pickle binary with dictionary keys? How to read them and map them to the videos (for example, slowfast8x8_r101_k400/ has 9645 *.pt files each corresponds to a video)?

Thanks.

QinghongLin commented 2 years ago

There is a gz file, after unzipping it (I unzip it on my mac), you will see a document that contains multiple *.pt. e.g., 0a8f6747-7f79-4176-85ca-f5ec01a15435.pt, this pt file corresponding to the video features of the clip: 0a8f6747-7f79-4176-85ca-f5ec01a15435.

The clip information is provided by the MQ metadata, i.e., clip xxx come from the video yyy with start time t1 and end time t2.

JunweiLiang commented 2 years ago

I see. The file you provided on Google drive is a .tar.gz file, and I extract it with tar -zxf and got 2034 *.pt file for the train/val part. Will try them.

JunweiLiang commented 2 years ago

So 0a8f6747-7f79-4176-85ca-f5ec01a15435 is the clip ID instead of video ID? Could you provide the feature files of the whole video as the VSGN baseline? They read the feature of the whole video and then cut the corresponding clip (see here). To follow your instructions I would need this video-level features.

Thanks.

QinghongLin commented 2 years ago

Yes, it is the clip ID. And sorry, I am currently unable to provide video-level features, a solution is to rewrite the data loader so that supports clip features as input.

srama2512 commented 2 years ago

@QinghongLin - Thanks for providing the clip features. I tried training the VSGN model using the Ego4D episodic-memory codebase instructions. But I'm not able to reproduce the val results from the paper. The numbers are quite a bit lower than the paper results (2nd row vs. 3rd row in the figure below).

image

Here is the training command I used. Note: I modified the data loader to use clip features instead of video features.

 python Train.py \
     --use_xGPN \
     --is_train true \
     --dataset ego4d \
     --feature_path data/egovlp_feats_official \
     --checkpoint_path checkpoints/ \
     --tb_dir tb/ \
     --batch_size 24 \
     --train_lr 0.00005 \
     --use_clip_features true \
     --input_feat_dim 256 \
     --num_epoch 100
QinghongLin commented 2 years ago

Hi, @srama2512 , I released the codebase here MQ.zip, you can check the data loader detail regarding clip-level feature loading. Besides, I am able to check the config parameters, can you have a try at the following parameters?

{'dataset': 'ego4d', 'is_train': 'true', 'out_prop_map': 'true', 'feature_path': '/mnt/sdb1/Datasets/Ego4d/action_feature_canonical', 'clip_anno': 'Evaluation/ego4d/annot/clip_annotations.json', 'moment_classes': 'Evaluation/ego4d/annot/moment_classes_idx.json', 'checkpoint_path': 'checkpoint', 'output_path': './outputs/hps_search_egovlp_egonce_features/23/', 'prop_path': 'proposals', 'prop_result_file': 'proposals_postNMS.json', 'detect_result_file': 'detections_postNMS.json', 'retrieval_result_file': 'retreival_postNMS.json', 'detad_sensitivity_file': 'detad_sensitivity', 'batch_size': 32, 'train_lr': 5e-05, 'weight_decay': 0.0001, 'num_epoch': 50, 'step_size': 15, 'step_gamma': 0.1, 'focal_alpha': 0.01, 'nms_alpha_detect': 0.46, 'nms_alpha_prop': 0.75, 'nms_thr': 0.4, 'temporal_scale': 928, 'input_feat_dim': 2304, 'bb_hidden_dim': 256, 'decoder_num_classes': 111, 'num_levels': 5, 'num_head_layers': 4, 'nfeat_mode': 'feat_ctr', 'num_neigh': 12, 'edge_weight': 'false', 'agg_type': 'max', 'gcn_insert': 'par', 'iou_thr': [0.5, 0.5, 0.7], 'anchor_scale': [1, 10], 'base_stride': 1, 'stitch_gap': 30, 'short_ratio': 0.4, 'clip_win_size': 0.38, 'use_xGPN': False, 'use_VSS': False, 'num_props': 200, 'tIoU_thr': [0.1, 0.2, 0.3, 0.4, 0.5], 'eval_stage': 'all', 'infer_datasplit': 'val'}

srama2512 commented 2 years ago

@QinghongLin - Thanks for sharing your code and the hyperparameters. I was able to obtain a similar performance. It turns out that there was a bug in the test_mq.py feature-extraction code that I used. I modified test_mq.py to increase the batch size here to 128. https://github.com/showlab/EgoVLP/blob/dc4a60f2dd7fcdc3206ac05d4f452b1c85361ab2/run/test_mq.py#L77-L87

The calculation of times = data['video'].shape[0] // batch does not work when video shape is not a multiple of the batch. It gets much worse when we increase the batch, leaving a residual set of all-zero features in the end. After changing that part of the code to the snippet below, it works as expected.

if data['video'].shape[0]% batch == 0:
    times = data['video'].shape[0] // batch
else:
    times = data['video'].shape[0] // batch + 1

Happy to send a PR if you'd like this bug-fix to be a part of the EgoVLP repo. This affects most of the test_*.py and causes a significant issue if anyone increases batch.