Closed amil-rp-work closed 4 years ago
Hi, thank you for the warm words!
--model transformer
(which by default is av_trasformer
). Here is the script we used for the ablation study (use it as guidance because I don't expect it to work without some error β I refactored code a bit after that, so some arguments' names might be different etc)Pre-training the captioning module
$env_python scripts/train_captioning_module.py \
--train_meta_path ./data/train_meta.csv \
--val_1_meta_path ./data/val_1_meta.csv \
--val_2_meta_path ./data/val_2_meta.csv \
--video_feature_name i3d \
--video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
--log_dir ./logs/ablation/V_only_cap \
--d_vid 1024 \
--d_model_video 1024 \
--d_model_caps 300 \
--d_model 1024 \
--model transformer \
--modality video \
--optimizer adam \
--dout_p 0.1 \
--N 2 \
--smoothing 0.7 \
--lr 5e-5 \
--B 32 \
--one_by_one_starts_at 1 \
--betas 0.9 0.999 \
--H 4 \
--word_emb_caps glove.840B.300d \
--device_ids 0
Training proposal generator with pre-trained encoder from the captioning module
$env_python scripts/train_proposal_generator.py \
--train_json_path ./data/train.json \
--train_meta_path ./data/train_meta.csv \
--val_1_meta_path ./data/val_1_meta.csv \
--val_2_meta_path ./data/val_2_meta.csv \
--log_dir ./logs/ablation/V_only_with_pretr_on_cap_V \
--video_feature_name i3d \
--video_features_path ./data/i3d_25fps_stack64step64_2stream_npy \
--feature_timespan_in_fps 64 \
--fps_at_extraction 25 \
--modality video \
--pretrained_cap_model_path ./logs/ablation/V_only_cap/best_model.pt \
--loc_model transformer \
--d_vid 1024 \
--d_model_caps 1024 \
--d_model 1024 \
--dout_p 0.1 \
--noobj_coeff 100 \
--smoothing 0.7 \
--epoch_num 70 \
--obj_coeff 1 \
--one_by_one_starts_at 5000 \
--N 2 \
--B 16 \
--inf_B_coeff 2 \
--lr 5e-5 \
--pad_video_feats_up_to 300 \
--max_prop_per_vid 100 \
--optimizer adam \
--betas 0.9 0.999 \
--anchors_num_video 128 \
--kernel_sizes_video 1 5 9 13 19 25 35 45 61 79 \
--conv_layers_video 512 512 \
--device_ids 0
Thanks for the detailed inputs @v-iashin! For your points:
You should mind the domain shift between TikTok videos and the videos from ActivityNetCaptions. So, I don't think it is related to non-English audio tracks. And, yes, it is not 100 % accurate, nor any other model. With this being said, I have no evidence to believe that without the audio stream it would work any better in your case. At the same time, I would suggest looking for a dataset with captions which is closer to the dataset you want to apply it on.
Got it @v-iashin ,thanks a lot for your inputs!!! You have been really helpful ! Are you aware of such a similar dataset in this domain?
Nope, let me know if you will find/collect one π
Closing this for now.
Hey @v-iashin As I had earlier explained my problem, I now have a dataset with about ~7000 videos and a sentence for each video as caption. I am planning to train the model on my dataset, for which I had the following doubts:
README.md
I observed we can separately train the captioning module and proposal generator. In my dataset, the proposal is the entire video, how do I go about training it?Hi,
Great!
I am currently using i3d
feature extraction pipeline on ~10k dataset. However, it seems to be extremely slow
# command i used
python3 main.py --feature_type i3d --device_ids 0 --file_with_video_paths ../file_paths.txt
0%| | 2/9970 [12:30<1186:19:09, 428.45s/it]
The following is what I get from nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 32C P0 116W / 250W | 8471MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Any tips on how can I speed this up ?
Hm. Well, it could be slow but not this slow.
Can you try to include prints into the code and see what takes most of the time? What are the number of CPU cores and the load?
How long does it take to run examples from the repo (video features)? For instance:
cd video_features
# make sure to be on a specific commit (4fa02bd5c5b8c34081dcfb609e2bcd5a973eaab2)
(i3d) python main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --pwc_path ./i3d/checkpoints/network-default.pytorch --video_paths ./sample/v_ZNVhz7ctTq0.mp4
How long does it take?
Some of my CPU stats CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Socket(s): 1 NUMA node(s): 1 Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz NUMA node0 CPU(s): 0-5
For the sample video, It took me A LOT of time, for about 9000 videos this is approx 9000 mins ~6 days?
python3 main.py --feature_type i3d --device_ids 0 --extraction_fps 25 --stack_size 24 --step_size 24 --video_paths ./sample/v_ZNVhz7ctTq0.mp4./output
0%| | 0/1 [00:00<?, ?it/s]{'type': 'separately_rgb_flow', 'rgb': tensor([[0.0541, 0.4469, 0.2189, ..., 0.0137, 0.0157, 0.3653],
[0.0832, 0.4049, 0.1971, ..., 0.0381, 0.0672, 0.0771],
[0.0514, 0.5042, 0.3255, ..., 0.1320, 0.1236, 0.1770],
...,
[0.0697, 0.3148, 0.3344, ..., 0.2917, 0.3177, 0.4807],
[0.1460, 0.5235, 0.4219, ..., 0.2835, 0.2133, 0.4215],
[0.3040, 0.3673, 0.2725, ..., 0.3369, 0.1261, 0.1997]],
device='cuda:0'), 'flow': tensor([[0.1192, 0.0772, 0.0517, ..., 0.2946, 0.0405, 0.0159],
[0.2450, 0.1872, 0.0581, ..., 0.0316, 0.2836, 0.0223],
[0.3013, 0.0491, 0.0040, ..., 0.1725, 0.2293, 0.2051],
...,
[0.4705, 0.1198, 0.4183, ..., 0.0238, 0.1565, 0.4061],
[1.0578, 0.1883, 0.4074, ..., 0.0052, 0.0598, 0.3931],
[0.6895, 0.1655, 0.3079, ..., 0.0628, 0.0267, 0.6614]],
device='cuda:0')}
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [01:02<00:00, 62.90s/it]
I removed the --pwc_path
because it threw the error FileNotFoundError: [Errno 2] No such file or directory: './i3d/checkpoints/network-default.pytorch'
I think I might have a clue, the videos are of dimension - 540x960
and with 60 fps
and average length being 60 secs
. Is it too much processing to handle?
In my case, it takes 7 seconds (32 threads, 2080ti).
Yep, I think the bottleneck is the CPU. Overall, if you are renting the machine save some $$ on GPU and add more cores and more GPUs (2xP100 > 1xV100) βΒ check the gpu load (I think it is mostly idle).
Also, try to have the temp directory (with frames) on a fast I/O disk (hdd < ssd < nvme) β or even the whole video base.
Anyway, the resolution is not a problem, but the fps is. You will cut the time in half just by specifying --extraction_fps 25
this is the default fps for the original I3D model. So, I would recommend using it.
Let me know how it goes.
As a side note, the path was wrong (thanks for the catch):
--pwc_path ./models/i3d/checkpoints/network-default.pytorch
Cool got it, I will try to get my CPU and HDD fixed, also will keep in mind the fps. Just a note since my videos are 60 fps, if I reduce the extraction fps, I presume lowering it to say 30 will skip alternate frames.
Also some newbie questions about the GPU load and args
Once I incorporate CPU, SSD and other changes will update you with the results π
I presume lowering it to say 30 will skip alternate frames.
Yes, you can see how ffmpeg does it by:
ffmpeg -i {video_path} -filter:v fps=fps={extraction_fps} {new_path}'
It will save the video with extraction_fps
into new_path
. Open the video and see it yourself.
How did you come about the conclusion it's idle
Well, because V100 > 2080Ti and in my case the load is not 100% and it executes in 7 seconds while yours in 60. Therefore, I just made a guess.
How will increasing the number of GPU's reduce the idle time?
There are two parts of the process: extracting/loading frames (done on CPU) and feature extraction (done on GPU). If you have two GPUs you might end up in situations where one GPU is idle and CPU is working (extracts frames/loading them) and the second one is working but the CPU is partially loaded (feature extraction). So, they do different parts you might benefit from having two GPUs. However, reusing the difference between v100 and p100 to invest in a better CPU might actually be a better idea.
You just need to profile your set up a bit and see what takes most of the time. On the video_features
master I have an updated I3D such that the reencoded video is saved as an mp4 with diff fps instead of lots of files which might be faster in your case. I do not guarantee the same features as there would be with the BMT state but since you are working on something else rather than a replication of BMT I think you could try these instead.
What impact is created by reducing/increasing step & stack size?
Usually, reducing these will reduce speed and vice-versa. Since you have ~60sec videos I would go with default parameters (25 fps instead of 60).
@v-iashin Thanks for such a detailed explanation!!! π
I have successfully incorporated your suggestions and extracted both VGGish
& i3D
features for my videos.
Now once I have extracted the features, there aren't instructions about the .csv
& .json
files I need to create.
Since each video of my dataset has a single caption, how do I go about creating the training files required apart from the extracted features?
Yep, there isn't since I regard video_features
and BMT
as separate projects. Also, you may check how train and val files are formed. I would start by trying to replace the start end fields with something gibberish in the JSON files and see if it throws an error on the way.
I am afraid I cannot provide support for your project because it is related to video captioning but this work is about dense video captioning. Please close the issue if you don't have any more questions about the topic of this issue.
@v-iashin Thanks a lot for all the help, sorry for the inconvenience π I will close the issue and really appreciate your patience and all the help π I hope you won't mind if I end up asking you some doubts again!
Hey @v-iashin Thanks for open sourcing such an awesome work!!! Kudos to you on this and MDVC. I was wondering since my videos are not of English language but I do require captions in the English language, how do I go about utilizing your work here? Is there a possibility to totally ignore audio features and just image features?