For the single node, it is ok when used "python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=1 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/". And the work directory is NFS shared .
But when I run two nodes I got some errors.
node1: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/
node2: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=1 --run_id=10 --file_store_path=/work/caffe2/
The outputs are as follows:
RuntimeError: [enforce fail at no_default_engine_op.h:45] . The operator CreateCommonWorld does not have a default engine implementation. Please specify an engine explicitly for this operator. Error from operator:
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "timeout_ms" i: 30000 } arg { name: "rank" i: 0 } arg { name: "interface" s: "" } arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "transport" s: "tcp" } arg { name: "size" i: 2 } device_option { device_type: 4 hip_gpu_id: 0 } engine: "GLOO"
docker images: docker.io/rocm/caffe2:rocm1.7-miopen-dev-v1
For the single node, it is ok when used "python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=1 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/". And the work directory is NFS shared . But when I run two nodes I got some errors. node1: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/ node2: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=1 --run_id=10 --file_store_path=/work/caffe2/
The outputs are as follows: RuntimeError: [enforce fail at no_default_engine_op.h:45] . The operator CreateCommonWorld does not have a default engine implementation. Please specify an engine explicitly for this operator. Error from operator: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "timeout_ms" i: 30000 } arg { name: "rank" i: 0 } arg { name: "interface" s: "" } arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "transport" s: "tcp" } arg { name: "size" i: 2 } device_option { device_type: 4 hip_gpu_id: 0 } engine: "GLOO"
Thank you.