Dear authors,
Thanks for your excellent work.
These days, I triy to run your code in GitHub. I follow the EXPERIMENTS.md and successfully run the 2-GPU data parallelism mode with nvidia-docker.
However, when I try to run the pipedream mode, some errors occur.
Traceback (most recent call last):
File "main_with_runtime.py", line 584, in <module>
main()
File "main_with_runtime.py", line 191, in main
enable_recompute=args.recompute)
File "../runtime.py", line 64, in __init__
Traceback (most recent call last):
File "main_with_runtime.py", line 584, in <module>
master_addr, rank, local_rank, num_ranks_in_server)
File "../runtime.py", line 190, in initialize
backend=self.distributed_backend)
File "../communication.py", line 42, in __init__
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group
timeout=timeout)
RuntimeError: Resource temporarily unavailable
main()
File "main_with_runtime.py", line 191, in main
enable_recompute=args.recompute)
File "../runtime.py", line 64, in __init__
master_addr, rank, local_rank, num_ranks_in_server)
File "../runtime.py", line 190, in initialize
backend=self.distributed_backend)
File "../communication.py", line 42, in __init__
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group
timeout=timeout)
RuntimeError: Resource temporarily unavailable
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/my_data/pipedream/runtime/image_classification/launch.py", line 173, in <module>
main()
File "/my_data/pipedream/runtime/image_classification/launch.py", line 169, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'main_with_runtime.py', '--rank=1', '--local_rank=1', '--data_dir', '/my_data/pipedream/dataset/imagenet/', '--master_addr', 'localhost', '--module', 'models.vgg16.gpus=2', '--checkpoint_dir', '/my_data/pipedream/output/2021-01-04T21:12:20', '--distributed_backend', 'gloo', '-b', '8', '--lr', '0.010000', '--lr_policy', 'polynomial', '--weight-decay', '0.000500', '--epochs', '60', '--print-freq', '10', '--verbose', '10', '--num_ranks_in_server', '2', '--config_path', 'models/vgg16/gpus=2/hybrid_conf.json']' returned non-zero exit status 1.
I also try to disable the launch_single_container mode while the "Resource temporarily unavailable" error occurs again.
May you help me check whether there is anything wrong or not? Do we need any hardware support, such as NVLink?
Thanks for your help.
Dear authors, Thanks for your excellent work. These days, I triy to run your code in GitHub. I follow the EXPERIMENTS.md and successfully run the 2-GPU data parallelism mode with nvidia-docker.
However, when I try to run the pipedream mode, some errors occur.
My command,
Configuration file
vgg16_2pipedream.yml
I also try to disable the
launch_single_container
mode while the "Resource temporarily unavailable" error occurs again. May you help me check whether there is anything wrong or not? Do we need any hardware support, such as NVLink? Thanks for your help.