v-iashin / BMT

Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
https://v-iashin.github.io/bmt
MIT License
225 stars 57 forks source link

Unexpected "UNK" captions with single video prediction #62

Open MarcosRodrigoT opened 10 months ago

MarcosRodrigoT commented 10 months ago

Hello, Vladimir.

First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.

I ran the sample code single_video_prediction.py on the given example (women_long_jump.mp4) without major issues (had to change CUDA and PyTorch versions from the conda environment as reported in https://github.com/v-iashin/BMT/issues/45).

However, when I tried the code on a custom video, let's call it my_video.mp4, I got some errors.

VGGish was unable to extract a .wav file from the audio because it had no aac codec (I checked with ffprobe my_video.mp4 and the audio used opus codec instead of aac). So, I changed these 2 lines in BMT/submodules/video_features/models/vggish/utils/utils.py for the following, which resolved the issue:

mp4_to_acc = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_aac_path}'
aac_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_wav_path}'

After obtaining the i3d and vggish features I tried running BMT on the video using the following command:

python ./sample/single_video_prediction.py \
--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./sample/my_video_vggish.npy \
--rgb_features_path ./sample/my_video_rgb.npy \
--flow_features_path ./sample/my_video_flow.npy \
--duration_in_secs 148.121 \
--device_id 0 \
--max_prop_per_vid 100 \
--nms_tiou_thresh 0.4

Obtaining:

Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
Traceback (most recent call last):
  File "./sample/single_video_prediction.py", line 313, in <module>
    cap_model, feature_paths, train_dataset, cap_cfg, args.device_id, proposals, args.duration_in_secs
  File "./sample/single_video_prediction.py", line 219, in caption_proposals
    for start, end, conf in proposals.squeeze():
  File "/home/mrt/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/tensor.py", line 456, in __iter__
    raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Checking it was iterating over a 0-d tensor, I tried removing the NMS and ran it again with:

python ./sample/single_video_prediction.py \
--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./sample/my_video_vggish.npy \
--rgb_features_path ./sample/my_video_rgb.npy \
--flow_features_path ./sample/my_video_flow.npy \
--duration_in_secs 148.121 \
--device_id 0 \
--max_prop_per_vid 100 \

Obtaining a list of sentences with the token "UNK":

Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
[{'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}]

I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:

Desktop (please complete the following information):

You conda environment

# packages in environment at /home/mrt/miniconda3/envs/bmt:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    conda-forge
_pytorch_select           0.2                       gpu_0    anaconda
absl-py                   0.8.1                    py37_0    conda-forge
asn1crypto                1.3.0                    py37_0    conda-forge
blas                      1.0                         mkl    conda-forge
ca-certificates           2020.1.1                      0    anaconda
certifi                   2020.4.5.1               py37_0    anaconda
cffi                      1.14.0           py37h2e261b9_0    anaconda
chardet                   3.0.4                 py37_1003    conda-forge
cryptography              2.8              py37h1ba5d50_0    anaconda
cudatoolkit               10.1.243             h6bb024c_0    anaconda
cudnn                     7.6.5.32             hc0a50b0_1    conda-forge
cymem                     1.31.2           py37h6bb024c_0    anaconda
cytoolz                   0.9.0.1          py37h14c3975_1    anaconda
dill                      0.2.9                    py37_0    conda-forge
en-core-web-sm            2.0.0                    pypi_0    pypi
future                    0.17.1                   py37_0    anaconda
idna                      2.9                        py_1    conda-forge
intel-openmp              2020.0                      166    anaconda
joblib                    0.14.1                     py_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7    conda-forge
libedit                   3.1.20181209         hc058e9b_0    anaconda
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0    anaconda
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libprotobuf               3.11.4               h8b12597_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0    anaconda
markdown                  3.2.1                      py_0    conda-forge
mkl                       2020.0                      166    anaconda
mkl-service               2.3.0            py37he904b0f_0
mkl_fft                   1.0.15           py37ha843d7b_0
mkl_random                1.1.0            py37hd6b4f25_0
msgpack-numpy             0.4.4.3                    py_0    conda-forge
msgpack-python            0.5.6            py37h6bb024c_1    anaconda
murmurhash                0.28.0           py37hf484d3e_0    anaconda
ncurses                   6.2                  he6710b0_1    anaconda
ninja                     1.9.0            py37hfd86e86_0    anaconda
numpy                     1.15.4           py37h7e9f1db_0
numpy-base                1.15.4           py37hde5b4d6_0
openjdk                   8.0.152              h7b6447c_3    anaconda
openssl                   1.1.1g               h7b6447c_0    anaconda
pandas                    0.24.2           py37he6710b0_0    anaconda
pip                       20.0.2                   py37_1    conda-forge
plac                      0.9.6                    py37_0    anaconda
preshed                   1.0.1            py37he6710b0_0    anaconda
protobuf                  3.11.4           py37h3340039_1    conda-forge
pycparser                 2.20                       py_0    conda-forge
pyopenssl                 19.1.0                   py37_0    conda-forge
pysocks                   1.7.1                    py37_0    conda-forge
python                    3.7.7           hcf32534_0_cpython    anaconda
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytorch                   1.4.0           cuda101py37h02f0884_0
pytz                      2020.1                     py_0    anaconda
readline                  8.0                  h7b6447c_0    anaconda
regex                     2018.07.11       py37h14c3975_0    anaconda
requests                  2.23.0                   py37_0    conda-forge
scikit-learn              0.22.1           py37hd81dba3_0
scipy                     1.3.1            py37h7c811a0_0
setuptools                46.1.3                   py37_0    anaconda
six                       1.14.0                   py37_0    conda-forge
spacy                     2.0.12           py37h962f231_0    anaconda
sqlite                    3.31.1               h62c20be_1    anaconda
tensorboard               1.14.0                   py37_0    conda-forge
termcolor                 1.1.0                    py37_1    anaconda
thinc                     6.10.3           py37h962f231_0    anaconda
tk                        8.6.8                hbc83047_0    anaconda
toolz                     0.10.0                     py_0    conda-forge
torchtext                 0.3.1                    pypi_0    pypi
tqdm                      4.46.0                     py_0    anaconda
ujson                     2.0.3            py37he6710b0_0    anaconda
urllib3                   1.25.8                   py37_0    anaconda
werkzeug                  1.0.1              pyh9f0ad1d_0    conda-forge
wheel                     0.34.2                   py37_0    conda-forge
wrapt                     1.10.11          py37h14c3975_2    anaconda
xz                        5.2.5                h7b6447c_0    anaconda
zlib                      1.2.11               h7b6447c_3    anaconda
v-iashin commented 10 months ago

Hi. Many thanks for such an elaborate issue description.

I noticed that it fails to install correct packages with anaconda. Would it be possible for you to try it with miniconda?

Could you try not to skip the aac transcoding?

Most likely the issue is with the video being encoded with different encoding. I think, you need to look into this because it worked for the example video.

Start by checking if the audio you give to vggish is playable and you can hear the sound as you expect it.

MarcosRodrigoT commented 10 months ago

Hi, thank you very much for your promt response.

I did create the environment using miniconda.

I created a minimal code snippet to extract the .wav files using my modified version and yours (both ran using conda env vggish).

import os
import subprocess

def which_ffmpeg() -> str:
    '''Determines the path to ffmpeg library

    Returns:
        str -- path to the library
    '''
    result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
    return ffmpeg_path

def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
    '''Extracts .wav file from .aac which is extracted from .mp4
    We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav

    Args:
        video_path (str): Path to a video
        audio_path_wo_ext (str):

    Returns:
        [str, str] -- path to the .wav and .aac audio
    '''
    assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
    assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'

    # extract video filename from the video_path
    video_filename = os.path.split(video_path)[-1].replace('.mp4', '')

    # the temp files will be saved in `tmp_path` with the same name
    audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')

    # constructing shell commands and calling them
    mp4_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_wav_path}'
    subprocess.call(mp4_to_wav.split())

    return

# extract audio files from .mp4
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample')

Running this code snippet does in fact create women_long_jump.wav and my_video.wav, and both audios are playable and I can hear them as expected.

import os
import subprocess

def which_ffmpeg() -> str:
    '''Determines the path to ffmpeg library

    Returns:
        str -- path to the library
    '''
    result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
    return ffmpeg_path

def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
    '''Extracts .wav file from .aac which is extracted from .mp4
    We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav

    Args:
        video_path (str): Path to a video
        audio_path_wo_ext (str):

    Returns:
        [str, str] -- path to the .wav and .aac audio
    '''
    assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
    assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'

    # extract video filename from the video_path
    video_filename = os.path.split(video_path)[-1].replace('.mp4', '')

    # the temp files will be saved in `tmp_path` with the same name
    audio_aac_path = os.path.join(tmp_path, f'{video_filename}.aac')
    audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')

    # constructing shell commands and calling them
    mp4_to_acc = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} -acodec copy {audio_aac_path}'
    aac_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {audio_aac_path} {audio_wav_path}'
    subprocess.call(mp4_to_acc.split())
    subprocess.call(aac_to_wav.split())

    return audio_wav_path, audio_aac_path

# extract audio files from .mp4
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample')

Running this code snippet produces women_long_jump.aac, women_long_jump.wav, and my_video.aac, but it does not create the expected my_video.wav.

The content returned by running ffprobe on each file is the following:

women_long_jump.mp4:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'women_long_jump.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2018-05-06T18:03:25.000000Z
  Duration: 00:00:35.16, start: 0.000000, bitrate: 535 kb/s
  Stream #0:0(und): Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuv420p, 480x360 [SAR 1:1 DAR 4:3], 437 kb/s, 24.83 fps, 24.83 tbr, 10900 tbn, 49.66 tbc (default)
    Metadata:
      creation_time   : 2018-05-06T18:03:25.000000Z
      handler_name    : ISO Media file produced by Google Inc. Created on: 05/06/2018.
      vendor_id       : [0][0][0][0]
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 95 kb/s (default)
    Metadata:
      creation_time   : 2018-05-06T18:03:25.000000Z
      handler_name    : ISO Media file produced by Google Inc. Created on: 05/06/2018.
      vendor_id       : [0][0][0][0]

my_video.mp4:

Input #0, matroska,webm, from 'my_video.mp4':
  Metadata:
    ENCODER         : Lavf58.76.100
  Duration: 00:02:28.12, start: -0.007000, bitrate: 4237 kb/s
  Stream #0:0(eng): Video: vp9 (Profile 2), yuv420p10le(tv, bt2020nc/bt2020/arib-std-b67), 1920x1080, SAR 1:1 DAR 16:9, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc (default)
    Metadata:
      DURATION        : 00:02:28.081000000
    Side data:
      Mastering Display Metadata, has_primaries:1 has_luminance:1 r(0.6800,0.3200) g(0.2650,0.6900) b(0.1500 0.0600) wp(0.3127, 0.3290) min_luminance=0.005000, max_luminance=1000.000000
  Stream #0:1(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
    Metadata:
      DURATION        : 00:02:28.121000000

women_long_jump.aac:

[aac @ 0x56196a07dec0] Estimating duration from bitrate, this may be inaccurate
Input #0, aac, from 'women_long_jump.aac':
  Duration: 00:00:34.78, bitrate: 99 kb/s
  Stream #0:0: Audio: aac (LC), 44100 Hz, stereo, fltp, 99 kb/s

my_video.aac:

[aac @ 0x55ca3de54ec0] Format aac detected only with low score of 1, misdetection possible!
my_video.aac: End of file

If I understand it correctly, your line of code expects a video containing an audio stream that uses an aac codec, and if this doesn't, it fails (or rather, it creates an .aac file with gibberish inside). However, I still believe you can directly extract a .wav file from the .mp4 file, as stated above this worked for me and created .wav files that were playable and I could hear them as expected (I could create a PR if you consider it appropiate).

I will try to convert my_video.mp4 to the same exact format of women_long_jump.mp4 and see if it does work that way. But I do not see why would your code not work for the .wav files I extracted. Could it be something else? I see your point that it must be something with the raw data that is fed to the network, but with my limited knowledge I can' t see a reason as to why it fails with my .wav files.

v-iashin commented 10 months ago

i think, the problem is with the video you are trying to use and yes it should work for any wav file. maybe your video is out of the domain of training videos.

If I understand it correctly, your line of code expects a

this line of code expects the video to be mp4, then it extracts whatever the audio is encoded in and transcodes it to aac. it could be that your ffmpeg does not support transcoding to aac.

try to do the same on google colab or some other machine. if the ffmpeg can't transcode from x to aac, the installation does not support this codec.

are you sure your mp4 file is not .mkv?

MarcosRodrigoT commented 10 months ago

Thank you for your indications!

I will try it on another machine/environment and see if another ffmpeg version supports the transcodification.

You are right in that the video was not a .mp4 file originally. The original video was downloaded with yt-dlp, which resulted in a .webm file. After discussing it with a more experienced colleague, I was told that it could be converted to an .mp4 container without issues. However, I remain unsure whether this could be the problem with the UNK tokens I was obtaining, as I was indeed able to extract a playable .wav file from it, and did not face any issue extracting i3d features.

I will work on your suggestions and let you know if they resolve the issue. Thank you very much for your time and consideration!

v-iashin commented 10 months ago

may i ask you try to transcode my video into your format vp9/opus etc and repeat your steps? do you get the same result?

if you are using youtube-dl, try to get a video with h264 and aac codecs and run on it

v-iashin commented 10 months ago

also, i realized that you use

https://github.com/v-iashin/video_features/blob/662ec51caf591e76724237f0454bdf7735a8dcb1/models/vggish/utils/utils.py#L28

which simply copies the codec (opus, instead of aac) for audio. can you specify aac there as suggested in https://github.com/v-iashin/BMT/issues/38.

MarcosRodrigoT commented 10 months ago

I changed line 28 as suggested in https://github.com/v-iashin/BMT/issues/38 and that seems to resolve the issue with extracting the appropiate .aac.

[aac @ 0x55ca3de54ec0] Format aac detected only with low score of 1, misdetection possible!
my_video.aac: End of file
[aac @ 0x5570d77c2ec0] Estimating duration from bitrate, this may be inaccurate
Input #0, aac, from 'my_video.aac':
  Duration: 00:02:32.99, bitrate: 127 kb/s
  Stream #0:0: Audio: aac (LC), 48000 Hz, stereo, fltp, 127 kb/s

Unfortunately I won't be able to try your other suggestions until monday. I will update you once I do.

Have a great weekend!

v-iashin commented 10 months ago

sure, have a great weekend.

did you try to run the prediction script where you were getting unks?

MarcosRodrigoT commented 10 months ago

I was not able to get that far today unfortunately. I was able to download the video with an h264 video codec using yt-dlp -S vcodec:h264 <url> and extract vggish features from it. However I was unable to download the video with an aac audio codec (only opus and mp4a are available when running yt-dlp -F --list-formats <url>).

On monday I will run the prediction script and try to get it to work however I can. Thank you very much once again for your time and consideration Vladimir.

MarcosRodrigoT commented 10 months ago

Hello, Vladimir.

I tried converting my_video.webm to the same exact format of women_long_jump.mp4 doing the following:

ffmpeg -i my_video.webm -c:v libx264 -c:a aac -b:a 160k -crf 20 -preset slow -vf format=yuv420p -movflags +faststart my_video.mp4

This resulted in a my_video.mp4 video with the same exact format of women_long_jump.mp4.

> ffprobe my_video.mp4

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'my_video.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.76.100
  Duration: 00:02:28.10, start: 0.000000, bitrate: 5550 kb/s
  Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt2020nc/bt2020/arib-std-b67), 1920x1080 [SAR 1:1 DAR 16:9], 5382 kb/s, 29.97 fps, 29.97 tbr, 30k tbn, 59.94 tbc (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
    Side data:
      Mastering Display Metadata, has_primaries:1 has_luminance:1 r(0.6800,0.3200) g(0.2650,0.6900) b(0.1500 0.0600) wp(0.3127, 0.3290) min_luminance=0.005000, max_luminance=1000.000000
  Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]

However, extracting vggish and i3d features from it and running BMT/sample/single_video_prediction.py still resulted in UNK tokens:

python ./sample/single_video_prediction.py --prop_generator_model_path ./sample/best_prop_model.pt --pretrained_cap_model_path ./sample/best_cap_model.pt --vggish_features_path ./sample/my_video_vggish.npy --rgb_features_path ./sample/my_video_rgb.npy --flow_features_path ./sample/my_video_flow.npy --duration_in_secs 148.121 --device_id 0 --max_prop_per_vid 100
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
[{'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}]

I then tried the opposite, converting women_long_jump.mp4 to the same exact format of my_video.webm by doing:

ffmpeg -i women_long_jump.mp4 -c:v libvpx-vp9 -c:a libopus -b:v 0 -crf 20 women_long_jump_transcoded.webm

After the transcoding, I simply renamed the file from women_long_jump_transcoded.webm to women_long_jump_transcoded.mp4 because there are some assert in the code that check for .mp4 files. The resulting file:

> ffprobe women_long_jump_transcoded.mp4

Input #0, matroska,webm, from 'women_long_jump_transcoded.mp4':
  Metadata:
    COMPATIBLE_BRANDS: isommp42
    MAJOR_BRAND     : mp42
    MINOR_VERSION   : 0
    ENCODER         : Lavf58.76.100
  Duration: 00:00:35.16, start: -0.007000, bitrate: 697 kb/s
  Stream #0:0: Video: vp9 (Profile 0), yuv420p(tv, progressive), 480x360, SAR 1:1 DAR 4:3, 24.83 fps, 24.83 tbr, 1k tbn, 1k tbc (default)
    Metadata:
      HANDLER_NAME    : ISO Media file produced by Google Inc. Created on: 05/06/2018.
      VENDOR_ID       : [0][0][0][0]
      ENCODER         : Lavc58.134.100 libvpx-vp9
      DURATION        : 00:00:35.086000000
  Stream #0:1: Audio: opus, 48000 Hz, stereo, fltp (default)
    Metadata:
      HANDLER_NAME    : ISO Media file produced by Google Inc. Created on: 05/06/2018.
      VENDOR_ID       : [0][0][0][0]
      ENCODER         : Lavc58.134.100 libopus
      DURATION        : 00:00:35.163000000

I extracted vggish and i3d features from it and run BMT/sample/single_video_prediction.py, which returned:

python ./sample/single_video_prediction.py --prop_generator_model_path ./sample/best_prop_model.pt --pretrained_cap_model_path ./sample/best_cap_model.pt --vggish_features_path ./sample/women_long_jump_transcoded_vggish.npy --rgb_features_path ./sample/women_long_jump_transcoded_rgb.npy --flow_features_path ./sample/women_long_jump_transcoded_flow.npy --duration_in_secs 35.163 --device_id 0 --max_prop_per_vid 100
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
[{'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}]

Seeing these results, I begin to incline more into believing that this specific video is indeed out of the domain your networks were trained on. However, the video shows people, a track-like floor, and things that I would guess are similar to what the networks might have seen during training. It does not seem like this specific video is too far apart from the video of women_long_jump.mp4.

Can you see any other reason why this might be?

EDIT

I noticed that the dense captions for the women_long_jump_transcoded.mp4 video I generated were all the same (i.e., 'The man continues to walk around the area and down the area'), and different from the ones I got when running BMT/sample/single_video_prediction.py on the original women_long_jump.mp4 file. So maybe there is something else going besides the domain.