Multilingual Audio - Githubissues

amil-rp-work commented 4 years ago

Hey @v-iashin Thanks for open sourcing such an awesome work!!! Kudos to you on this and MDVC. I was wondering since my videos are not of English language but I do require captions in the English language, how do I go about utilizing your work here? Is there a possibility to totally ignore audio features and just image features?

v-iashin commented 4 years ago

Hi, thank you for the warm words!

I think it is not a problem to use videos with a different language of narration. Audio information contains not only speech but also sounds of actions, background, etc. Therefore, I wouldn't worry about it too much unless I run an experiment.
To use visual features only, you may specify --model transformer (which by default is av_trasformer). Here is the script we used for the ablation study (use it as guidance because I don't expect it to work without some error – I refactored code a bit after that, so some arguments' names might be different etc)

Pre-training the captioning module

$env_python scripts/train_captioning_module.py \
    --train_meta_path ./data/train_meta.csv \
    --val_1_meta_path ./data/val_1_meta.csv \
    --val_2_meta_path ./data/val_2_meta.csv \
    --video_feature_name i3d \
    --video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
    --log_dir ./logs/ablation/V_only_cap \
    --d_vid 1024 \
    --d_model_video 1024 \
    --d_model_caps 300 \
    --d_model 1024 \
    --model transformer \
    --modality video \
    --optimizer adam \
    --dout_p 0.1 \
    --N 2 \
    --smoothing 0.7 \
    --lr 5e-5 \
    --B 32 \
    --one_by_one_starts_at 1 \
    --betas 0.9 0.999 \
    --H 4 \
    --word_emb_caps glove.840B.300d \
    --device_ids 0

Training proposal generator with pre-trained encoder from the captioning module

$env_python scripts/train_proposal_generator.py \
    --train_json_path ./data/train.json \
    --train_meta_path ./data/train_meta.csv \
    --val_1_meta_path ./data/val_1_meta.csv \
    --val_2_meta_path ./data/val_2_meta.csv \
    --log_dir ./logs/ablation/V_only_with_pretr_on_cap_V \
    --video_feature_name i3d \
    --video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
    --feature_timespan_in_fps 64 \
    --fps_at_extraction 25 \
    --modality video \
    --pretrained_cap_model_path ./logs/ablation/V_only_cap/best_model.pt \
    --loc_model transformer \
    --d_vid 1024 \
    --d_model_caps 1024 \
    --d_model 1024 \
    --dout_p 0.1 \
    --noobj_coeff 100 \
    --smoothing 0.7 \
    --epoch_num 70 \
    --obj_coeff 1 \
    --one_by_one_starts_at 5000 \
    --N 2 \
    --B 16 \
    --inf_B_coeff 2 \
    --lr 5e-5 \
    --pad_video_feats_up_to 300 \
    --max_prop_per_vid 100 \
    --optimizer adam \
    --betas 0.9 0.999 \
    --anchors_num_video 128 \
    --kernel_sizes_video 1 5 9 13 19 25 35 45 61 79 \
    --conv_layers_video 512 512 \
    --device_ids 0

amil-rp-work commented 4 years ago

Thanks for the detailed inputs @v-iashin! For your points:

I have tried to experiment on my set of videos. To give more context about my videos they are mostly shot by front camera similar to short videos from apps like Tiktok etc. Let me share the videos to provide more context Videos.zip I am leveraging the Single Video inference pipeline and noticed that the model confuses between male/female and concepts like walls etc. This is the reason that I wanted your inputs here, if possible
- Is it possible that the model confuses sometimes due to audio because of the non-English part? I understand background sound is also a part as you pointed out but I just hope it's not confusing. If you observe the first video in the example I shared it confuses with simple concepts like wall, gender etc.
- Can you suggest some improvements for single video inference?
Thanks a lot for sharing the scripts!!! Much appreciated! Do you have a checkpoint of any such pre-trained model?

v-iashin commented 4 years ago

You should mind the domain shift between TikTok videos and the videos from ActivityNetCaptions. So, I don't think it is related to non-English audio tracks. And, yes, it is not 100 % accurate, nor any other model. With this being said, I have no evidence to believe that without the audio stream it would work any better in your case. At the same time, I would suggest looking for a dataset with captions which is closer to the dataset you want to apply it on.

amil-rp-work commented 4 years ago

Got it @v-iashin ,thanks a lot for your inputs!!! You have been really helpful ! Are you aware of such a similar dataset in this domain?

v-iashin commented 4 years ago

Nope, let me know if you will find/collect one 🙂

Closing this for now.

amil-rp-work commented 4 years ago

Hey @v-iashin As I had earlier explained my problem, I now have a dataset with about ~7000 videos and a sentence for each video as caption. I am planning to train the model on my dataset, for which I had the following doubts:

In the README.md I observed we can separately train the captioning module and proposal generator. In my dataset, the proposal is the entire video, how do I go about training it?
Also, about training time, is it possible to train the model on Google colab GPU, or an instance of Tesla V100 or P100?

v-iashin commented 4 years ago

Hi,

Great!

Since we train the caption module first and, then, train the proposal module I would start by just ignoring the proposal part here: https://github.com/v-iashin/BMT#train. Let me know if something will prevent you from doing so. By the way, make sure to extract features first if you haven't yet as the training is done on top of the pre-extracted features. If the I3D will not work for you, you may also try to extract other features (see https://github.com/v-iashin/video_features).
I haven't tried to train it on Google Collab. I was using 2080Ti (11gb) with f32 so I think there should not be any problem with P100 and later gpus.

amil-rp-work commented 4 years ago

I am currently using i3d feature extraction pipeline on ~10k dataset. However, it seems to be extremely slow

# command i used
python3 main.py --feature_type i3d --device_ids 0 --file_with_video_paths ../file_paths.txt

0%|                                                                               | 2/9970 [12:30<1186:19:09, 428.45s/it]

The following is what I get from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0   116W / 250W |   8471MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Any tips on how can I speed this up ?

v-iashin commented 4 years ago

Hm. Well, it could be slow but not this slow.

Can you try to include prints into the code and see what takes most of the time? What are the number of CPU cores and the load?

How long does it take to run examples from the repo (video features)? For instance:

cd video_features
# make sure to be on a specific commit (4fa02bd5c5b8c34081dcfb609e2bcd5a973eaab2)
(i3d) python main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --pwc_path ./i3d/checkpoints/network-default.pytorch --video_paths ./sample/v_ZNVhz7ctTq0.mp4

How long does it take?

amil-rp-work commented 4 years ago

Some of my CPU stats CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Socket(s): 1 NUMA node(s): 1 Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz NUMA node0 CPU(s): 0-5

For the sample video, It took me A LOT of time, for about 9000 videos this is approx 9000 mins ~6 days?

python3 main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --video_paths ./sample/v_ZNVhz7ctTq0.mp4./output
  0%|                                                                                                | 0/1 [00:00<?, ?it/s]{'type': 'separately_rgb_flow', 'rgb': tensor([[0.0541, 0.4469, 0.2189,  ..., 0.0137, 0.0157, 0.3653],
        [0.0832, 0.4049, 0.1971,  ..., 0.0381, 0.0672, 0.0771],
        [0.0514, 0.5042, 0.3255,  ..., 0.1320, 0.1236, 0.1770],
        ...,
        [0.0697, 0.3148, 0.3344,  ..., 0.2917, 0.3177, 0.4807],
        [0.1460, 0.5235, 0.4219,  ..., 0.2835, 0.2133, 0.4215],
        [0.3040, 0.3673, 0.2725,  ..., 0.3369, 0.1261, 0.1997]],
       device='cuda:0'), 'flow': tensor([[0.1192, 0.0772, 0.0517,  ..., 0.2946, 0.0405, 0.0159],
        [0.2450, 0.1872, 0.0581,  ..., 0.0316, 0.2836, 0.0223],
        [0.3013, 0.0491, 0.0040,  ..., 0.1725, 0.2293, 0.2051],
        ...,
        [0.4705, 0.1198, 0.4183,  ..., 0.0238, 0.1565, 0.4061],
        [1.0578, 0.1883, 0.4074,  ..., 0.0052, 0.0598, 0.3931],
        [0.6895, 0.1655, 0.3079,  ..., 0.0628, 0.0267, 0.6614]],
       device='cuda:0')}
100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:02<00:00, 62.90s/it]

I removed the --pwc_path because it threw the error FileNotFoundError: [Errno 2] No such file or directory: './i3d/checkpoints/network-default.pytorch'

I think I might have a clue, the videos are of dimension - 540x960 and with 60 fps and average length being 60 secs. Is it too much processing to handle?

v-iashin commented 4 years ago

In my case, it takes 7 seconds (32 threads, 2080ti).

Yep, I think the bottleneck is the CPU. Overall, if you are renting the machine save some $$ on GPU and add more cores and more GPUs (2xP100 > 1xV100) – check the gpu load (I think it is mostly idle).

Also, try to have the temp directory (with frames) on a fast I/O disk (hdd < ssd < nvme) – or even the whole video base.

Anyway, the resolution is not a problem, but the fps is. You will cut the time in half just by specifying --extraction_fps 25 this is the default fps for the original I3D model. So, I would recommend using it.

Let me know how it goes.

As a side note, the path was wrong (thanks for the catch):

--pwc_path ./models/i3d/checkpoints/network-default.pytorch

amil-rp-work commented 4 years ago

Cool got it, I will try to get my CPU and HDD fixed, also will keep in mind the fps. Just a note since my videos are 60 fps, if I reduce the extraction fps, I presume lowering it to say 30 will skip alternate frames.

Also some newbie questions about the GPU load and args

How did you come about the conclusion it's idle
How will increasing the number of GPU's reduce the idle time?
What impact is created by reducing/increasing step & stack size?

Once I incorporate CPU, SSD and other changes will update you with the results 👍

v-iashin commented 4 years ago

I presume lowering it to say 30 will skip alternate frames.

Yes, you can see how ffmpeg does it by:

ffmpeg -i {video_path} -filter:v fps=fps={extraction_fps} {new_path}'

It will save the video with extraction_fps into new_path. Open the video and see it yourself.

How did you come about the conclusion it's idle

Well, because V100 > 2080Ti and in my case the load is not 100% and it executes in 7 seconds while yours in 60. Therefore, I just made a guess.

How will increasing the number of GPU's reduce the idle time?

There are two parts of the process: extracting/loading frames (done on CPU) and feature extraction (done on GPU). If you have two GPUs you might end up in situations where one GPU is idle and CPU is working (extracts frames/loading them) and the second one is working but the CPU is partially loaded (feature extraction). So, they do different parts you might benefit from having two GPUs. However, reusing the difference between v100 and p100 to invest in a better CPU might actually be a better idea.

You just need to profile your set up a bit and see what takes most of the time. On the video_features master I have an updated I3D such that the reencoded video is saved as an mp4 with diff fps instead of lots of files which might be faster in your case. I do not guarantee the same features as there would be with the BMT state but since you are working on something else rather than a replication of BMT I think you could try these instead.

What impact is created by reducing/increasing step & stack size?

Usually, reducing these will reduce speed and vice-versa. Since you have ~60sec videos I would go with default parameters (25 fps instead of 60).

amil-rp-work commented 4 years ago

@v-iashin Thanks for such a detailed explanation!!! 👍 I have successfully incorporated your suggestions and extracted both VGGish & i3D features for my videos. Now once I have extracted the features, there aren't instructions about the .csv & .json files I need to create. Since each video of my dataset has a single caption, how do I go about creating the training files required apart from the extracted features?

v-iashin commented 4 years ago

Yep, there isn't since I regard video_features and BMT as separate projects. Also, you may check how train and val files are formed. I would start by trying to replace the start end fields with something gibberish in the JSON files and see if it throws an error on the way.

I am afraid I cannot provide support for your project because it is related to video captioning but this work is about dense video captioning. Please close the issue if you don't have any more questions about the topic of this issue.

amil-rp-work commented 4 years ago

@v-iashin Thanks a lot for all the help, sorry for the inconvenience 😅 I will close the issue and really appreciate your patience and all the help 👍 I hope you won't mind if I end up asking you some doubts again!

v-iashin / BMT

Multilingual Audio #8