Open Davidyao99 opened 2 years ago
Hi @Davidyao99, I guess you should use python -m torch.distributed.launch --nproc_per_node=1
for 1 GPU instead of python -m torch.distributed.launch --nproc_per_node=4
. If nothing is right after that, printing more logs here will be useful to solve the problem. Good luck~
Thank you for responding! I ran the command with --nproc_per_node=1 and received the following error:
Traceback (most recent call last): File "main_task_caption.py", line 24, in <module> torch.distributed.init_process_group(backend="nccl") File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module> main() File "/mnt/c/users/dyao/documents/research/UniVL/univl/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/mnt/c/users/dyao/documents/research/UniVL/univl/bin/python', '-u', 'main_task_caption.py', '--local_rank=0', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', '/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.
Thank you for your time and help! I am not familiar with pytorch distribution, so sorry.
Hi @Davidyao99, what is your whole command?
I followed the steps in downloading all the necessary dependencies and data to run the code. When running the code, this error is thrown:
in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['<path to python executable>', '-u', 'main_task_caption.py', '--local_rank=3', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', 'ckpts/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.
There is only 1 gpu on my laptop so I am not sure if this is causing the issue. I just wanted to try out the video captioning capability of this model. Thank you!