Error when train end2end - Githubissues

ijkguo / mx-rcnn

Parallel Faster R-CNN implementation with MXNet.

Other

669 stars 292 forks source link

Error when train end2end #34

Closed OnlySang closed 7 years ago

OnlySang commented 7 years ago

when I command like python trian_end2end.py --gpu 0,1, error occurs like: ('Error in proposal.infer_shape: ', 'Only single item batches are supported') I think it shuold support multi-gpu, what should I do?

ijkguo commented 7 years ago

https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/symbol/proposal.py Code for infer_shape does not output such error message. Please check again.

OnlySang commented 7 years ago

@precedenceguo thx. You have made a update.

ijkguo commented 7 years ago

Yes, updated before this issue.

OnlySang commented 7 years ago

@precedenceguo But I tried ur mxnet version and RBG's caffe version under the same hyper parameters on the same dataset, caffe's version is nearly 3 times faster than mxnet version. I'm a newer of mxnet, is there something to optimize ur code?

ijkguo commented 7 years ago

Maybe true. You did not use the same hyper parameters. Some of them is not included here :). Why don't you elaborate the speed comparison and see if we can make it faster while being able to parallelize (caffe cannot :))?

OnlySang commented 7 years ago

@precedenceguo I tried them one batch one image. I noticed that ur nms do not use gpu. I will add nms_gpu. And what do u mean by some parameter not include here? I changed the rcnn/config.py and anchor setup in related functions.

ijkguo commented 7 years ago

gpu_nms is faster than python nms. Looking forward to a comparison with gpu_nms on both sides.

OnlySang commented 7 years ago

@precedenceguo I add gpu_nms, when using only one gpu, the speed is almost the same, but GPU load is about 30% more than caffe version. Two gpu is useless to speed.

ijkguo commented 7 years ago

thanks for this information

ijkguo commented 7 years ago

Taking another look, two gpu is useful in training phase (1.5x-1.8x). So did you mean to speed up the testing phase before?

OnlySang commented 7 years ago

@precedenceguo Because I am new to mxnet, I know little about how mxnet parallization works. Can u give some clues to do more update? Training under one gpu and two gpu, the speed is almost the same.

ijkguo commented 7 years ago

Please checkout http://mxnet.io/how_to/perf.html, http://mxnet.io/how_to/multi_devices.html, https://github.com/dmlc/mxnet/blob/master/tools/bandwidth/README.md

OnlySang commented 7 years ago

@precedenceguo multi-gpu is more like sequence run. The speed of multi-gpu is the same with single gpu. I noticed u use class module to do multi-gpu. But the code in DataParallelExecutorGroup is: for exec_ in self.execs: exec_.forward(is_train=is_train) I think this is not parallelization.

ijkguo commented 7 years ago

Try alternate training. There is speedup. As to your question, try example/image-classification/train_cifar10.py. It also uses DataParallelExecutorGroup to execute.

ijkguo commented 7 years ago

With gpu_nms, I observed 1.4x speed up with VGG on 2 gpus. DataParallelExecutorGroup may seems like sequence run but rather it is the way dependency engine works. The cost of synchronizing VGG is great so stay tuned for resnet.