microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

Unable to run video captioning code #34

Open Davidyao99 opened 2 years ago

Davidyao99 commented 2 years ago

I followed the steps in downloading all the necessary dependencies and data to run the code. When running the code, this error is thrown:

in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['<path to python executable>', '-u', 'main_task_caption.py', '--local_rank=3', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', 'ckpts/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.

There is only 1 gpu on my laptop so I am not sure if this is causing the issue. I just wanted to try out the video captioning capability of this model. Thank you!

ArrowLuo commented 2 years ago

Hi @Davidyao99, I guess you should use python -m torch.distributed.launch --nproc_per_node=1 for 1 GPU instead of python -m torch.distributed.launch --nproc_per_node=4. If nothing is right after that, printing more logs here will be useful to solve the problem. Good luck~

Davidyao99 commented 2 years ago

Thank you for responding! I ran the command with --nproc_per_node=1 and received the following error:

Traceback (most recent call last): File "main_task_caption.py", line 24, in <module> torch.distributed.init_process_group(backend="nccl") File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module> main() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/mnt/c/users/dyao/documents/research/UniVL/univl/bin/python', '-u', 'main_task_caption.py', '--local_rank=0', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', '/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.

Thank you for your time and help! I am not familiar with pytorch distribution, so sorry.

ArrowLuo commented 2 years ago

Hi @Davidyao99, what is your whole command?