Open MarcosRodrigoT opened 10 months ago
Hi. Many thanks for such an elaborate issue description.
I noticed that it fails to install correct packages with anaconda. Would it be possible for you to try it with miniconda?
Could you try not to skip the aac transcoding?
Most likely the issue is with the video being encoded with different encoding. I think, you need to look into this because it worked for the example video.
Start by checking if the audio you give to vggish is playable and you can hear the sound as you expect it.
Hi, thank you very much for your promt response.
I did create the environment using miniconda.
I created a minimal code snippet to extract the .wav
files using my modified version and yours (both ran using conda env vggish
).
.wav
directly from the mp4:import os
import subprocess
def which_ffmpeg() -> str:
'''Determines the path to ffmpeg library
Returns:
str -- path to the library
'''
result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
return ffmpeg_path
def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
'''Extracts .wav file from .aac which is extracted from .mp4
We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav
Args:
video_path (str): Path to a video
audio_path_wo_ext (str):
Returns:
[str, str] -- path to the .wav and .aac audio
'''
assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'
# extract video filename from the video_path
video_filename = os.path.split(video_path)[-1].replace('.mp4', '')
# the temp files will be saved in `tmp_path` with the same name
audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')
# constructing shell commands and calling them
mp4_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_wav_path}'
subprocess.call(mp4_to_wav.split())
return
# extract audio files from .mp4
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample')
Running this code snippet does in fact create women_long_jump.wav
and my_video.wav
, and both audios are playable and I can hear them as expected.
import os
import subprocess
def which_ffmpeg() -> str:
'''Determines the path to ffmpeg library
Returns:
str -- path to the library
'''
result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
return ffmpeg_path
def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
'''Extracts .wav file from .aac which is extracted from .mp4
We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav
Args:
video_path (str): Path to a video
audio_path_wo_ext (str):
Returns:
[str, str] -- path to the .wav and .aac audio
'''
assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'
# extract video filename from the video_path
video_filename = os.path.split(video_path)[-1].replace('.mp4', '')
# the temp files will be saved in `tmp_path` with the same name
audio_aac_path = os.path.join(tmp_path, f'{video_filename}.aac')
audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')
# constructing shell commands and calling them
mp4_to_acc = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} -acodec copy {audio_aac_path}'
aac_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {audio_aac_path} {audio_wav_path}'
subprocess.call(mp4_to_acc.split())
subprocess.call(aac_to_wav.split())
return audio_wav_path, audio_aac_path
# extract audio files from .mp4
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample')
Running this code snippet produces women_long_jump.aac
, women_long_jump.wav
, and my_video.aac
, but it does not create the expected my_video.wav
.
The content returned by running ffprobe
on each file is the following:
women_long_jump.mp4
:
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'women_long_jump.mp4':
Metadata:
major_brand : mp42
minor_version : 0
compatible_brands: isommp42
creation_time : 2018-05-06T18:03:25.000000Z
Duration: 00:00:35.16, start: 0.000000, bitrate: 535 kb/s
Stream #0:0(und): Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuv420p, 480x360 [SAR 1:1 DAR 4:3], 437 kb/s, 24.83 fps, 24.83 tbr, 10900 tbn, 49.66 tbc (default)
Metadata:
creation_time : 2018-05-06T18:03:25.000000Z
handler_name : ISO Media file produced by Google Inc. Created on: 05/06/2018.
vendor_id : [0][0][0][0]
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 95 kb/s (default)
Metadata:
creation_time : 2018-05-06T18:03:25.000000Z
handler_name : ISO Media file produced by Google Inc. Created on: 05/06/2018.
vendor_id : [0][0][0][0]
my_video.mp4
:
Input #0, matroska,webm, from 'my_video.mp4':
Metadata:
ENCODER : Lavf58.76.100
Duration: 00:02:28.12, start: -0.007000, bitrate: 4237 kb/s
Stream #0:0(eng): Video: vp9 (Profile 2), yuv420p10le(tv, bt2020nc/bt2020/arib-std-b67), 1920x1080, SAR 1:1 DAR 16:9, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc (default)
Metadata:
DURATION : 00:02:28.081000000
Side data:
Mastering Display Metadata, has_primaries:1 has_luminance:1 r(0.6800,0.3200) g(0.2650,0.6900) b(0.1500 0.0600) wp(0.3127, 0.3290) min_luminance=0.005000, max_luminance=1000.000000
Stream #0:1(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
Metadata:
DURATION : 00:02:28.121000000
women_long_jump.aac
:
[aac @ 0x56196a07dec0] Estimating duration from bitrate, this may be inaccurate
Input #0, aac, from 'women_long_jump.aac':
Duration: 00:00:34.78, bitrate: 99 kb/s
Stream #0:0: Audio: aac (LC), 44100 Hz, stereo, fltp, 99 kb/s
my_video.aac
:
[aac @ 0x55ca3de54ec0] Format aac detected only with low score of 1, misdetection possible!
my_video.aac: End of file
If I understand it correctly, your line of code expects a video containing an audio stream that uses an aac
codec, and if this doesn't, it fails (or rather, it creates an .aac
file with gibberish inside). However, I still believe you can directly extract a .wav
file from the .mp4
file, as stated above this worked for me and created .wav
files that were playable and I could hear them as expected (I could create a PR if you consider it appropiate).
I will try to convert my_video.mp4
to the same exact format of women_long_jump.mp4
and see if it does work that way. But I do not see why would your code not work for the .wav
files I extracted. Could it be something else? I see your point that it must be something with the raw data that is fed to the network, but with my limited knowledge I can' t see a reason as to why it fails with my .wav
files.
i think, the problem is with the video you are trying to use and yes it should work for any wav file. maybe your video is out of the domain of training videos.
If I understand it correctly, your line of code expects a
this line of code expects the video to be mp4, then it extracts whatever the audio is encoded in and transcodes it to aac. it could be that your ffmpeg does not support transcoding to aac.
try to do the same on google colab or some other machine. if the ffmpeg can't transcode from x to aac, the installation does not support this codec.
are you sure your mp4 file is not .mkv?
Thank you for your indications!
I will try it on another machine/environment and see if another ffmpeg version supports the transcodification.
You are right in that the video was not a .mp4
file originally. The original video was downloaded with yt-dlp, which resulted in a .webm
file. After discussing it with a more experienced colleague, I was told that it could be converted to an .mp4
container without issues. However, I remain unsure whether this could be the problem with the UNK
tokens I was obtaining, as I was indeed able to extract a playable .wav
file from it, and did not face any issue extracting i3d
features.
I will work on your suggestions and let you know if they resolve the issue. Thank you very much for your time and consideration!
may i ask you try to transcode my video into your format vp9/opus etc and repeat your steps? do you get the same result?
if you are using youtube-dl, try to get a video with h264 and aac codecs and run on it
also, i realized that you use
which simply copies the codec (opus, instead of aac) for audio. can you specify aac there as suggested in https://github.com/v-iashin/BMT/issues/38.
I changed line 28 as suggested in https://github.com/v-iashin/BMT/issues/38 and that seems to resolve the issue with extracting the appropiate .aac
.
my_video.aac
):[aac @ 0x55ca3de54ec0] Format aac detected only with low score of 1, misdetection possible!
my_video.aac: End of file
my_video.aac
):[aac @ 0x5570d77c2ec0] Estimating duration from bitrate, this may be inaccurate
Input #0, aac, from 'my_video.aac':
Duration: 00:02:32.99, bitrate: 127 kb/s
Stream #0:0: Audio: aac (LC), 48000 Hz, stereo, fltp, 127 kb/s
Unfortunately I won't be able to try your other suggestions until monday. I will update you once I do.
Have a great weekend!
sure, have a great weekend.
did you try to run the prediction script where you were getting unks?
I was not able to get that far today unfortunately. I was able to download the video with an h264
video codec using yt-dlp -S vcodec:h264 <url>
and extract vggish
features from it. However I was unable to download the video with an aac
audio codec (only opus
and mp4a
are available when running yt-dlp -F --list-formats <url>
).
On monday I will run the prediction script and try to get it to work however I can. Thank you very much once again for your time and consideration Vladimir.
Hello, Vladimir.
I tried converting my_video.webm
to the same exact format of women_long_jump.mp4
doing the following:
ffmpeg -i my_video.webm -c:v libx264 -c:a aac -b:a 160k -crf 20 -preset slow -vf format=yuv420p -movflags +faststart my_video.mp4
This resulted in a my_video.mp4
video with the same exact format of women_long_jump.mp4
.
> ffprobe my_video.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'my_video.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.76.100
Duration: 00:02:28.10, start: 0.000000, bitrate: 5550 kb/s
Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt2020nc/bt2020/arib-std-b67), 1920x1080 [SAR 1:1 DAR 16:9], 5382 kb/s, 29.97 fps, 29.97 tbr, 30k tbn, 59.94 tbc (default)
Metadata:
handler_name : VideoHandler
vendor_id : [0][0][0][0]
Side data:
Mastering Display Metadata, has_primaries:1 has_luminance:1 r(0.6800,0.3200) g(0.2650,0.6900) b(0.1500 0.0600) wp(0.3127, 0.3290) min_luminance=0.005000, max_luminance=1000.000000
Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
Metadata:
handler_name : SoundHandler
vendor_id : [0][0][0][0]
However, extracting vggish
and i3d
features from it and running BMT/sample/single_video_prediction.py
still resulted in UNK
tokens:
python ./sample/single_video_prediction.py --prop_generator_model_path ./sample/best_prop_model.pt --pretrained_cap_model_path ./sample/best_cap_model.pt --vggish_features_path ./sample/my_video_vggish.npy --rgb_features_path ./sample/my_video_rgb.npy --flow_features_path ./sample/my_video_flow.npy --duration_in_secs 148.121 --device_id 0 --max_prop_per_vid 100
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
./sample/best_cap_model.pt
[{'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk '}]
I then tried the opposite, converting women_long_jump.mp4
to the same exact format of my_video.webm
by doing:
ffmpeg -i women_long_jump.mp4 -c:v libvpx-vp9 -c:a libopus -b:v 0 -crf 20 women_long_jump_transcoded.webm
After the transcoding, I simply renamed the file from women_long_jump_transcoded.webm
to women_long_jump_transcoded.mp4
because there are some assert
in the code that check for .mp4
files. The resulting file:
> ffprobe women_long_jump_transcoded.mp4
Input #0, matroska,webm, from 'women_long_jump_transcoded.mp4':
Metadata:
COMPATIBLE_BRANDS: isommp42
MAJOR_BRAND : mp42
MINOR_VERSION : 0
ENCODER : Lavf58.76.100
Duration: 00:00:35.16, start: -0.007000, bitrate: 697 kb/s
Stream #0:0: Video: vp9 (Profile 0), yuv420p(tv, progressive), 480x360, SAR 1:1 DAR 4:3, 24.83 fps, 24.83 tbr, 1k tbn, 1k tbc (default)
Metadata:
HANDLER_NAME : ISO Media file produced by Google Inc. Created on: 05/06/2018.
VENDOR_ID : [0][0][0][0]
ENCODER : Lavc58.134.100 libvpx-vp9
DURATION : 00:00:35.086000000
Stream #0:1: Audio: opus, 48000 Hz, stereo, fltp (default)
Metadata:
HANDLER_NAME : ISO Media file produced by Google Inc. Created on: 05/06/2018.
VENDOR_ID : [0][0][0][0]
ENCODER : Lavc58.134.100 libopus
DURATION : 00:00:35.163000000
I extracted vggish
and i3d
features from it and run BMT/sample/single_video_prediction.py
, which returned:
python ./sample/single_video_prediction.py --prop_generator_model_path ./sample/best_prop_model.pt --pretrained_cap_model_path ./sample/best_cap_model.pt --vggish_features_path ./sample/women_long_jump_transcoded_vggish.npy --rgb_features_path ./sample/women_long_jump_transcoded_rgb.npy --flow_features_path ./sample/women_long_jump_transcoded_flow.npy --duration_in_secs 35.163 --device_id 0 --max_prop_per_vid 100
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
./sample/best_cap_model.pt
[{'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}, {'start': 0.0, 'end': 35.2, 'sentence': 'The man continues to walk around the area and down the area'}]
Seeing these results, I begin to incline more into believing that this specific video is indeed out of the domain your networks were trained on. However, the video shows people, a track-like floor, and things that I would guess are similar to what the networks might have seen during training. It does not seem like this specific video is too far apart from the video of women_long_jump.mp4
.
Can you see any other reason why this might be?
EDIT
I noticed that the dense captions for the women_long_jump_transcoded.mp4
video I generated were all the same (i.e., 'The man continues to walk around the area and down the area'), and different from the ones I got when running BMT/sample/single_video_prediction.py
on the original women_long_jump.mp4
file. So maybe there is something else going besides the domain.
Hello, Vladimir.
First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.
I ran the sample code
single_video_prediction.py
on the given example (women_long_jump.mp4
) without major issues (had to change CUDA and PyTorch versions from the conda environment as reported in https://github.com/v-iashin/BMT/issues/45).However, when I tried the code on a custom video, let's call it
my_video.mp4
, I got some errors.VGGish was unable to extract a
.wav
file from the audio because it had noaac
codec (I checked withffprobe my_video.mp4
and the audio usedopus
codec instead ofaac
). So, I changed these 2 lines in BMT/submodules/video_features/models/vggish/utils/utils.py for the following, which resolved the issue:After obtaining the
i3d
andvggish
features I tried running BMT on the video using the following command:Obtaining:
Checking it was iterating over a 0-d tensor, I tried removing the
NMS
and ran it again with:Obtaining a list of sentences with the token "UNK":
I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:
torch
1.4.0 instead of 1.2.0, as if was the closest version that could work with my GPU. I kepttorchtext
at version 0.3.1 (same as in yours). However, the code works for the example video you provide, so it seems unlikely that this is the root cause..wav
file directly from the.mp4
, skipping the intermediate step of obtaining an.aac
file. I do not see any inconvenient in doing so, in fact, it seems like a more portable option. However, I remain unsure whether you did this for a specific reason I am unaware of.Desktop (please complete the following information):
You
conda
environment