msracver / Deformable-ConvNets

Deformable Convolutional Networks
MIT License
4.03k stars 959 forks source link

Distributed training (across multiple machines) #183

Open arunbuduri opened 6 years ago

arunbuduri commented 6 years ago

According to this, https://mxnet.incubator.apache.org/faq/multi_devices.html, MxNET supports training on a distributed cluster across several machines (with multiple GPUs per machine).

I'm looking to train using this repo in a distributed setup with the following assumptions.

  1. Each machine (with Ubuntu) has 4 K80 GPUs (2 physical cards)
  2. Set up a cluster of 8 or 10 such machines

Before I start, I wanted to check if anyone has tried training this repo in such a distributed setup? If so, could you share any setup requirements, settings etc

If you've not tried this repo specifically but tried mxnet distributed training in general, could you share any "heads-up" issues, settings, requirements etc to get the training working in a distributed environment?

YoWhatever commented 5 years ago

I'm working on this too, have you done? Any advice? thx.

chinakook commented 5 years ago

The MXNet official reproduce several RCNN models with SOTA result in https://github.com/dmlc/gluon-cv I think you can migrate to Gluon-CV.