rocm-caffe2 distributed mode

learner321 commented 6 years ago

docker images: docker.io/rocm/caffe2:rocm1.7-miopen-dev-v1

For the single node, it is ok when used "python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=1 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/". And the work directory is NFS shared . But when I run two nodes I got some errors. node1: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/ node2: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=1 --run_id=10 --file_store_path=/work/caffe2/

The outputs are as follows: RuntimeError: [enforce fail at no_default_engine_op.h:45] . The operator CreateCommonWorld does not have a default engine implementation. Please specify an engine explicitly for this operator. Error from operator: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "timeout_ms" i: 30000 } arg { name: "rank" i: 0 } arg { name: "interface" s: "" } arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "transport" s: "tcp" } arg { name: "size" i: 2 } device_option { device_type: 4 hip_gpu_id: 0 } engine: "GLOO"

Thank you.

ashishfarmer commented 6 years ago

Currently Caffe2 is ported to support only on single GPU. We will have support for multi GPU case in future

learner321 commented 6 years ago

ok, got it. Thank you:)

petrex commented 6 years ago

https://github.com/pytorch/pytorch

rocmarchive / realcaffe2

rocm-caffe2 distributed mode #140