FasterRCNN Distributed Learning

jpedrofontes commented 6 years ago

Hello.

I have adapted the code from FasterRCNN_train.py to use distributed learning. This is what the learner creation looks like:

# Instantiate the learners and the trainer object
num_quantization_bits = 32
warm_up = 0

lr_schedule = learning_parameter_schedule_per_sample(lr_per_sample)
local_learner = momentum_sgd(others, lr_schedule, mm_schedule, l2_regularization_weight=l2_reg_weight,
                       unit_gain=False, use_mean_gradient=True)
learner = cntk.distributed.data_parallel_distributed_learner(local_learner, 
                                                             num_quantization_bits=num_quantization_bits,  
                                                             distributed_after=warm_up)

bias_lr_per_sample = [v * bias_lr_mult for v in lr_per_sample]
bias_lr_schedule = learning_parameter_schedule_per_sample(bias_lr_per_sample)
bias_local_learner = momentum_sgd(biases, bias_lr_schedule, mm_schedule, l2_regularization_weight=l2_reg_weight,
                       unit_gain=False, use_mean_gradient=True)
bias_learner = cntk.distributed.data_parallel_distributed_learner(bias_local_learner, 
                                                                  num_quantization_bits=num_quantization_bits, 
                                                                  distributed_after=warm_up)
trainer = Trainer(None, (loss, pred_error), [learner, bias_learner])

After running, the error that I get is this:

CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=jfontes ; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)
Please provide a detector name as the single argument. Usage:
    python DetectionDemo.py <detector_name>
Available detectors: ['FastRCNN', 'FasterRCNN']
Using default detector: FasterRCNN
training FasterRCNN
Using base model:   AlexNet
lr_per_sample:      [0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 1e-05]
Training model for 1 epochs.
Training 57513152 parameters in 27 parameter tensors.
Traceback (most recent call last):
  File "DetectionDemo.py", line 63, in <module>
    eval_model = od.train_object_detector(cfg)
  File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\utils\od_utils.py", line 21, in train_object_detector
    eval_model = train_faster_rcnn(cfg)
  File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 291, in train_faster_rcnn
    eval_model = train_faster_rcnn_e2e(cfg)
  File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 329, in train_faster_rcnn_e2e
    e2e_lr_per_sample_scaled, mm_schedule, cfg["CNTK"].L2_REG_WEIGHT, cfg["CNTK"].E2E_MAX_EPOCHS, cfg)
  File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 576, in train_model
    trainer.train_minibatch(data)                                    # update model with it
  File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\train\trainer.py", line 181, in train_minibatch
    arguments, device)
  File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\cntk_py.py", line 2975, in train_minibatch_overload_for_minibatchdata
    return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, *args)
RuntimeError: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=CVIG-JF ; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

[CALL STACK]
    > Microsoft::MSR::CNTK::CudaTimer::  Stop
    - Microsoft::MSR::CNTK::CudaTimer::  Stop
    - Microsoft::MSR::CNTK::GPUMatrix<float>::  MultiplyAndWeightedAdd
    - Microsoft::MSR::CNTK::Matrix<float>::  MultiplyAndWeightedAdd
    - Microsoft::MSR::CNTK::TensorView<float>::  DoMatrixProductOf
    - Microsoft::MSR::CNTK::TensorView<float>::  AssignMatrixProductOf
    - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this (x2)
    - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD
    - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this
    - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD
    - CNTK::Function::  Forward
    - CNTK::  CreateTrainer
    - CNTK::Trainer::  TotalNumberOfUnitsSeen
    - CNTK::Trainer::  TrainMinibatch (x2)

Have I done anything wrong? Does the FasterRCNN model supports distributed learning? I'm trying to run the training in a machine with 2x NVIDIA GeForce GTX 1080. Running on one GPU works, but with both I get the CUBLAS error. I have already tested the ConvNet test with CIFAR10 dataset with both GPU's and it worked.

ke1337 commented 6 years ago

I think this is the problem. For distributed training, each process should use different GPU ID.

jpedrofontes commented 6 years ago

When running using mpiexec, the program outputs that GPU[0] and GPU[1] are selected. I think that the program is running on both 😕

But I'll try your solution. Just need to add that line to FasterRCNN_config.py? Or remove it?

kyoro1 commented 6 years ago

Hi @KeDengMS , @spandantiwari . The other day, I registered some modifications for distributed learning for Faster R-CNN here. Is it helpful for resolving the issue?

jpedrofontes commented 6 years ago

@kyoro1 your implementation works. Thanks 😄

kyoro1 commented 6 years ago

@KeDengMS @spandantiwari : According to the above, how about injecting my modification into the master branch?

jpedrofontes commented 6 years ago

Just noticed that @kyoro1 implementation it's very slow. It takes 20 sec or more to process 100 samples. Probably there's still work to do.

kyoro1 commented 6 years ago

@jpedrofontes, In your trial, how long did it take to train with single GPU, i.e. without distributed setting? Just, want to know the situation.

jpedrofontes commented 6 years ago

I used a dataset with +/-5000 800x800 images and it took 37 secs to train 100 samples. It is training at a rate of 2.5 samples/s. It's taking more than 50 mins to complete a single epoch on a single GPU.

jpedrofontes commented 6 years ago

Hello @kyoro1

I'll detail the full details from the issue reported in the previous conversation:

Using E2E training will be too slow, both single and multi GPU (rate of 2.5 samples/s). In a dataset with 5000 images will take more than 30 minutes to complete a single epoch;

Using 4-stage training will be quicker (rate of 25 samples/s), but will give an error on the start of the second stage. The error is the following:


Traceback (most recent call last):
File "DetectionDemo.py", line 59, in <module>
eval_model = od.train_object_detector(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\utils\od_utils.py", line 21, in train_object_detector
eval_model = train_faster_rcnn(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 293, in train_faster_rcnn
eval_model = train_faster_rcnn_alternating(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 447, in train_faster_rcnn_alternating
rpn_rois_input=rpn_rois_input, buffered_rpn_proposals=buffered_proposals_s1)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 541, in train_model
distributed_after = cfg.WARM_UP)                     # no warm start as default
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\internal\swig_helper.py", line 69, in wrapper
result = f(*args, **kwds)
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\train\distributed.py", line 143, in data_parallel_distributed_learner
cntk_py.mpicommunicator(),
RuntimeError: MPIWrapperMpi: this is a singleton class that can only be instantiated once per process

[CALL STACK]

std::enable_shared_from_this:: operator=
std::enable_shared_from_this::enable_shared_from_this (x2)

CNTK:: QuantizedMPICommunicator

CNTK:: MPICommunicator

PyInit__cntk_py

PyCFunction_Call

PyEval_GetFuncDesc

PyEval_EvalFrameEx (x2)

PyFunction_SetAnnotations

PyObject_Call

PyEval_GetFuncDesc

PyEval_EvalFrameEx (x2)
PyEval_GetFuncDesc

kyoro1 commented 6 years ago

@jpedrofontes Oh, really... So, we should wait for resolving the bugs as @KeDengMS said as above.

jpedrofontes commented 6 years ago

I'll be waiting. Anything you need from me, just text here 😄

microsoft / CNTK

FasterRCNN Distributed Learning #3077