Open gyani91 opened 6 years ago
@pietern @apaszke I would really appreciate your help. Thanks a lot guys.
We are writing the new distributed backend for Caffe2 and pytorch. We can make this one of the init_method.
Another two points that I think are worth mentioning are that I have followed all the changes mentioned in a diff by @pietern, even the changes for the Gloo header files. But it doesn't help with the situation.
And I have changed the code for resnet_trainer.py from:
num_shards = args.num_shards
shard_id = args.shard_id
interfaces = args.distributed_interfaces.split(",")
# Rendezvous using MPI when run with mpirun
if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))
to:
num_shards = args.num_shards
shard_id = args.shard_id
interfaces = args.distributed_interfaces.split(",")
# Rendezvous using MPI when run with mpirun
#if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
if True:
#num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
#shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))
shard_id = int(os.getenv("SLURM_PROCID", 0))
before the above change the error was:
E0817 11:20:18.277760 29415 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.277838 29415 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.277843 29415 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
E0817 11:20:18.278333 15794 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.278412 15794 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.278416 15794 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
INFO:resnet50_trainer:Using epoch size: 1000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
E0817 11:20:18.365449 16188 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.365527 16188 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.365532 16188 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
E0817 11:20:18.377269 14509 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.377346 14509 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.377351 14509 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
E0817 11:20:18.563855 15794 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.564237 15794 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.564251 15794 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
E0817 11:20:18.567664 29415 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.568056 29415 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.568071 29415 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0135490894318 secs
INFO:memonger:Memonger memory optimization took 0.0135049819946 secs
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
E0817 11:20:18.654834 16188 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.655194 16188 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.655208 16188 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
E0817 11:20:18.663950 14509 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.664294 14509 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.664320 14509 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0132520198822 secs
INFO:memonger:Memonger memory optimization took 0.0132851600647 secs
E0817 11:20:51.136189 15794 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
File "resnet50_trainer.py", line 608, in <module>
main()
File "resnet50_trainer.py", line 604, in main
Train(args)
File "resnet50_trainer.py", line 444, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
StringifyProto(net),
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
E0817 11:20:51.183814 29415 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
File "resnet50_trainer.py", line 608, in <module>
main()
File "resnet50_trainer.py", line 604, in main
Train(args)
File "resnet50_trainer.py", line 444, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
StringifyProto(net),
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
E0817 11:20:51.454591 14509 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
File "resnet50_trainer.py", line 608, in <module>
main()
File "resnet50_trainer.py", line 604, in main
Train(args)
File "resnet50_trainer.py", line 444, in Train
E0817 11:20:51.455612 16188 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
workspace.RunNetOnce(train_model.param_init_net)
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
StringifyProto(net),
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace: File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace: File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
File "resnet50_trainer.py", line 608, in <module>
main()
File "resnet50_trainer.py", line 604, in main
Train(args)
File "resnet50_trainer.py", line 444, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
StringifyProto(net),
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
srun: error: nid02966: task 3: Exited with exit code 1
srun: Terminating job step 9072007.0
srun: error: nid02963: task 0: Exited with exit code 1
srun: error: nid02965: task 2: Exited with exit code 1
srun: error: nid02964: task 1: Exited with exit code 1
Issue description
Unable to use MPI rendezvous in Caffe2.
I understand that this information may not be sufficient for helping me out. Hence, I request you to ask to perform whatever steps that are required to get more information about the situation.
I am grateful for your help.
Code example
Details: For reproducibility, I am using a container made using the following the Dockerfile:
The command:
The output/error:
System Info