Closed nqx12348 closed 1 year ago
Hi, thanks for your interest and question.
If I remember correctly, I do not use any postprocessing. I directly borrow the extracted feature by HERO authors.
There are many possible reasons. Do you strictly follow the steps from HERO video feature extractor (including feature normalization and how to concatenate)? My suggestion is that you can probe extracted features of a random video to check whether both features are the same.
Hi, thanks for your awesome work! I'm recently working on the VCMR task with this codebase, I downloaded tvr_feature_release.tar.gz following readme.md, and it worked well for me, I get VCMR result (R@1 IoU=0.7) of 7.6. But I have some problems reproducing the metrics with features extracted by myself. I extracted slowfast+resnet features of TVQA raw videos using code in HERO_Video_Feature_Extractor, concatenate them and get a D=4352 visual feature, and tried to train CONQUER using this feature, but I can only get VCMR result of R@1=6.6. I use checkpoint of slowfast downloaded Here and checkpoint of resnet152 downloaded from torchvision, clip_len=3/2. I carefully examined the procedure of feature extraction, and don't find mistakes. So I wonder do you use any postprocess for features extracted by HERO? Or can you give any possible reasons for this? Thanks!