v-iashin / BMT

Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
https://v-iashin.github.io/bmt
MIT License
223 stars 57 forks source link

Multilingual Audio #8

Closed amil-rp-work closed 3 years ago

amil-rp-work commented 3 years ago

Hey @v-iashin Thanks for open sourcing such an awesome work!!! Kudos to you on this and MDVC. I was wondering since my videos are not of English language but I do require captions in the English language, how do I go about utilizing your work here? Is there a possibility to totally ignore audio features and just image features?

v-iashin commented 3 years ago

Hi, thank you for the warm words!

  1. I think it is not a problem to use videos with a different language of narration. Audio information contains not only speech but also sounds of actions, background, etc. Therefore, I wouldn't worry about it too much unless I run an experiment.
  2. To use visual features only, you may specify --model transformer (which by default is av_trasformer). Here is the script we used for the ablation study (use it as guidance because I don't expect it to work without some error – I refactored code a bit after that, so some arguments' names might be different etc)

Pre-training the captioning module

$env_python scripts/train_captioning_module.py \
    --train_meta_path ./data/train_meta.csv \
    --val_1_meta_path ./data/val_1_meta.csv \
    --val_2_meta_path ./data/val_2_meta.csv \
    --video_feature_name i3d \
    --video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
    --log_dir ./logs/ablation/V_only_cap \
    --d_vid 1024 \
    --d_model_video 1024 \
    --d_model_caps 300 \
    --d_model 1024 \
    --model transformer \
    --modality video \
    --optimizer adam \
    --dout_p 0.1 \
    --N 2 \
    --smoothing 0.7 \
    --lr 5e-5 \
    --B 32 \
    --one_by_one_starts_at 1 \
    --betas 0.9 0.999 \
    --H 4 \
    --word_emb_caps glove.840B.300d \
    --device_ids 0

Training proposal generator with pre-trained encoder from the captioning module

$env_python scripts/train_proposal_generator.py \
    --train_json_path ./data/train.json \
    --train_meta_path ./data/train_meta.csv \
    --val_1_meta_path ./data/val_1_meta.csv \
    --val_2_meta_path ./data/val_2_meta.csv \
    --log_dir ./logs/ablation/V_only_with_pretr_on_cap_V \
    --video_feature_name i3d \
    --video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
    --feature_timespan_in_fps 64 \
    --fps_at_extraction 25 \
    --modality video \
    --pretrained_cap_model_path ./logs/ablation/V_only_cap/best_model.pt \
    --loc_model transformer \
    --d_vid 1024 \
    --d_model_caps 1024 \
    --d_model 1024 \
    --dout_p 0.1 \
    --noobj_coeff 100 \
    --smoothing 0.7 \
    --epoch_num 70 \
    --obj_coeff 1 \
    --one_by_one_starts_at 5000 \
    --N 2 \
    --B 16 \
    --inf_B_coeff 2 \
    --lr 5e-5 \
    --pad_video_feats_up_to 300 \
    --max_prop_per_vid 100 \
    --optimizer adam \
    --betas 0.9 0.999 \
    --anchors_num_video 128 \
    --kernel_sizes_video 1 5 9 13 19 25 35 45 61 79 \
    --conv_layers_video 512 512 \
    --device_ids 0
amil-rp-work commented 3 years ago

Thanks for the detailed inputs @v-iashin! For your points:

v-iashin commented 3 years ago

You should mind the domain shift between TikTok videos and the videos from ActivityNetCaptions. So, I don't think it is related to non-English audio tracks. And, yes, it is not 100 % accurate, nor any other model. With this being said, I have no evidence to believe that without the audio stream it would work any better in your case. At the same time, I would suggest looking for a dataset with captions which is closer to the dataset you want to apply it on.

amil-rp-work commented 3 years ago

Got it @v-iashin ,thanks a lot for your inputs!!! You have been really helpful ! Are you aware of such a similar dataset in this domain?

v-iashin commented 3 years ago

Nope, let me know if you will find/collect one πŸ™‚

Closing this for now.

amil-rp-work commented 3 years ago

Hey @v-iashin As I had earlier explained my problem, I now have a dataset with about ~7000 videos and a sentence for each video as caption. I am planning to train the model on my dataset, for which I had the following doubts:

v-iashin commented 3 years ago

Hi,

Great!

amil-rp-work commented 3 years ago

I am currently using i3d feature extraction pipeline on ~10k dataset. However, it seems to be extremely slow

# command i used
python3 main.py --feature_type i3d --device_ids 0 --file_with_video_paths ../file_paths.txt

0%|                                                                               | 2/9970 [12:30<1186:19:09, 428.45s/it]

The following is what I get from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0   116W / 250W |   8471MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Any tips on how can I speed this up ?

v-iashin commented 3 years ago

Hm. Well, it could be slow but not this slow.

Can you try to include prints into the code and see what takes most of the time? What are the number of CPU cores and the load?

How long does it take to run examples from the repo (video features)? For instance:

cd video_features
# make sure to be on a specific commit (4fa02bd5c5b8c34081dcfb609e2bcd5a973eaab2)
(i3d) python main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --pwc_path ./i3d/checkpoints/network-default.pytorch --video_paths ./sample/v_ZNVhz7ctTq0.mp4

How long does it take?

amil-rp-work commented 3 years ago

Some of my CPU stats CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Socket(s): 1 NUMA node(s): 1 Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz NUMA node0 CPU(s): 0-5

For the sample video, It took me A LOT of time, for about 9000 videos this is approx 9000 mins ~6 days?

python3 main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --video_paths ./sample/v_ZNVhz7ctTq0.mp4./output
  0%|                                                                                                | 0/1 [00:00<?, ?it/s]{'type': 'separately_rgb_flow', 'rgb': tensor([[0.0541, 0.4469, 0.2189,  ..., 0.0137, 0.0157, 0.3653],
        [0.0832, 0.4049, 0.1971,  ..., 0.0381, 0.0672, 0.0771],
        [0.0514, 0.5042, 0.3255,  ..., 0.1320, 0.1236, 0.1770],
        ...,
        [0.0697, 0.3148, 0.3344,  ..., 0.2917, 0.3177, 0.4807],
        [0.1460, 0.5235, 0.4219,  ..., 0.2835, 0.2133, 0.4215],
        [0.3040, 0.3673, 0.2725,  ..., 0.3369, 0.1261, 0.1997]],
       device='cuda:0'), 'flow': tensor([[0.1192, 0.0772, 0.0517,  ..., 0.2946, 0.0405, 0.0159],
        [0.2450, 0.1872, 0.0581,  ..., 0.0316, 0.2836, 0.0223],
        [0.3013, 0.0491, 0.0040,  ..., 0.1725, 0.2293, 0.2051],
        ...,
        [0.4705, 0.1198, 0.4183,  ..., 0.0238, 0.1565, 0.4061],
        [1.0578, 0.1883, 0.4074,  ..., 0.0052, 0.0598, 0.3931],
        [0.6895, 0.1655, 0.3079,  ..., 0.0628, 0.0267, 0.6614]],
       device='cuda:0')}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [01:02<00:00, 62.90s/it]

I removed the --pwc_path because it threw the error FileNotFoundError: [Errno 2] No such file or directory: './i3d/checkpoints/network-default.pytorch'

I think I might have a clue, the videos are of dimension - 540x960 and with 60 fps and average length being 60 secs. Is it too much processing to handle?

v-iashin commented 3 years ago

In my case, it takes 7 seconds (32 threads, 2080ti).

Yep, I think the bottleneck is the CPU. Overall, if you are renting the machine save some $$ on GPU and add more cores and more GPUs (2xP100 > 1xV100) – check the gpu load (I think it is mostly idle).

Also, try to have the temp directory (with frames) on a fast I/O disk (hdd < ssd < nvme) – or even the whole video base.

Anyway, the resolution is not a problem, but the fps is. You will cut the time in half just by specifying --extraction_fps 25 this is the default fps for the original I3D model. So, I would recommend using it.

Let me know how it goes.

As a side note, the path was wrong (thanks for the catch):

--pwc_path ./models/i3d/checkpoints/network-default.pytorch
amil-rp-work commented 3 years ago

Cool got it, I will try to get my CPU and HDD fixed, also will keep in mind the fps. Just a note since my videos are 60 fps, if I reduce the extraction fps, I presume lowering it to say 30 will skip alternate frames.

Also some newbie questions about the GPU load and args

Once I incorporate CPU, SSD and other changes will update you with the results πŸ‘

v-iashin commented 3 years ago

I presume lowering it to say 30 will skip alternate frames.

Yes, you can see how ffmpeg does it by:

ffmpeg -i {video_path} -filter:v fps=fps={extraction_fps} {new_path}'

It will save the video with extraction_fps into new_path. Open the video and see it yourself.

How did you come about the conclusion it's idle

Well, because V100 > 2080Ti and in my case the load is not 100% and it executes in 7 seconds while yours in 60. Therefore, I just made a guess.

How will increasing the number of GPU's reduce the idle time?

There are two parts of the process: extracting/loading frames (done on CPU) and feature extraction (done on GPU). If you have two GPUs you might end up in situations where one GPU is idle and CPU is working (extracts frames/loading them) and the second one is working but the CPU is partially loaded (feature extraction). So, they do different parts you might benefit from having two GPUs. However, reusing the difference between v100 and p100 to invest in a better CPU might actually be a better idea.

You just need to profile your set up a bit and see what takes most of the time. On the video_features master I have an updated I3D such that the reencoded video is saved as an mp4 with diff fps instead of lots of files which might be faster in your case. I do not guarantee the same features as there would be with the BMT state but since you are working on something else rather than a replication of BMT I think you could try these instead.

What impact is created by reducing/increasing step & stack size?

Usually, reducing these will reduce speed and vice-versa. Since you have ~60sec videos I would go with default parameters (25 fps instead of 60).

amil-rp-work commented 3 years ago

@v-iashin Thanks for such a detailed explanation!!! πŸ‘ I have successfully incorporated your suggestions and extracted both VGGish & i3D features for my videos. Now once I have extracted the features, there aren't instructions about the .csv & .json files I need to create. Since each video of my dataset has a single caption, how do I go about creating the training files required apart from the extracted features?

v-iashin commented 3 years ago

Yep, there isn't since I regard video_features and BMT as separate projects. Also, you may check how train and val files are formed. I would start by trying to replace the start end fields with something gibberish in the JSON files and see if it throws an error on the way.

I am afraid I cannot provide support for your project because it is related to video captioning but this work is about dense video captioning. Please close the issue if you don't have any more questions about the topic of this issue.

amil-rp-work commented 3 years ago

@v-iashin Thanks a lot for all the help, sorry for the inconvenience πŸ˜… I will close the issue and really appreciate your patience and all the help πŸ‘ I hope you won't mind if I end up asking you some doubts again!