tusen-ai / simpledet

A Simple and Versatile Framework for Object Detection and Instance Recognition
Apache License 2.0
3.08k stars 488 forks source link

Training with Single GPU #211

Closed shilpi2015 closed 5 years ago

shilpi2015 commented 5 years ago

while training $ python3 detection_train.py --config config/tridentnet_r50v2c4_c5_1x.py following error is generated

RuntimeError: simple_bind error. Arguments: data: (2, 3, 800, 1200) im_info: (2, 3) gt_bbox: (2, 100, 5) valid_ranges: (2, 3, 2) rpn_cls_label: (2, 3, 56250) rpn_reg_target: (2, 3, 60, 50, 75) rpn_reg_weight: (2, 3, 60, 50, 75) [17:35:54] src/engine/./../common/cuda_utils.h:319: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal Stack trace: [bt] (0) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f5790308222] [bt] (1) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::SetDevice(int)+0xd8) [0x7f57928d8c98] [bt] (2) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0x48) [0x7f57928d8d08] [bt] (3) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x10e) [0x7f57928fc86e] [bt] (4) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle)+0x57) [0x7f57928ff197] [bt] (5) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(+0x312c109) [0x7f57921b1109] [bt] (6) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::Chunk::Chunk(mxnet::TShape, mxnet::Context, bool, int)+0x198) [0x7f57921d6ca8] [bt] (7) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x97) [0x7f57921d7757] [bt] (8) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::InitZeros(mxnet::NDArrayStorageType, mxnet::TShape const&, mxnet::Context const&, int)+0x58) [0x7f57921d7978]

terminate called without an active exception terminate called recursively Aborted (core dumped)

In config file , i have used used gpus = [0] instead of gpus = [0, 1, 2, 3, 4, 5, 6, 7]. Still getting above error. Pl help

RogerChern commented 5 years ago

Could you please tell me how did you set up your environment?

shilpi2015 commented 5 years ago

Ubuntu 16.04 Python 3.6 in virtual mode CUDA Version 9.0 CUDNN_MAJOR 7 I have build MXNet from scratch...mxnext and cython according to instructions of readme file

following error is occuring after Merging stage4_unit3_bn3

08-28 12:55:37 total iter 87942 08-28 12:55:37 lr 0.02, lr_iters [60000, 80000] 08-28 12:55:37 lr mode: step 08-28 12:55:37 warmup lr 0.0, warmup step 3000 Traceback (most recent call last): File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/symbol/symbol.py", line 1675, in simple_bind ctypes.byref(exe_handle))) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/base.py", line 254, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:25:40] src/engine/./../common/cuda_utils.h:319: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal Stack trace: [bt] (0) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f6fe4d6c222] [bt] (1) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::SetDevice(int)+0xd8) [0x7f6fe733cc98] [bt] (2) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0x48) [0x7f6fe733cd08] [bt] (3) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x10e) [0x7f6fe736086e] [bt] (4) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle)+0x57) [0x7f6fe7363197] [bt] (5) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(+0x312c109) [0x7f6fe6c15109] [bt] (6) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::Chunk::Chunk(mxnet::TShape, mxnet::Context, bool, int)+0x198) [0x7f6fe6c3aca8] [bt] (7) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x97) [0x7f6fe6c3b757] [bt] (8) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::InitZeros(mxnet::NDArrayStorageType, mxnet::TShape const&, mxnet::Context const&, int)+0x58) [0x7f6fe6c3b978]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "detection_train.py", line 277, in train_net(parse_args()) File "detection_train.py", line 259, in train_net profile=profile File "/home/shilpi/shilpi/simpledet-master/core/detection_module.py", line 965, in fit for_training=True, force_rebind=force_rebind) File "/home/shilpi/shilpi/simpledet-master/core/detection_module.py", line 446, in bind state_names=self._state_names) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/executor_group.py", line 280, in init self.bind_exec(data_shapes, label_shapes, shared_group) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/executor_group.py", line 376, in bind_exec shared_group)) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/module/executor_group.py", line 670, in _bind_ith_exec shared_buffer=shared_data_arrays, *input_shapes) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/symbol/symbol.py", line 1681, in simple_bind raise RuntimeError(error_msg) RuntimeError: simple_bind error. Arguments: data: (2, 3, 800, 1200) im_info: (2, 3) gt_bbox: (2, 100, 5) valid_ranges: (2, 3, 2) rpn_cls_label: (2, 3, 56250) rpn_reg_target: (2, 3, 60, 50, 75) rpn_reg_weight: (2, 3, 60, 50, 75) [10:25:40] src/engine/./../common/cuda_utils.h:319: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal Stack trace: [bt] (0) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f6fe4d6c222] [bt] (1) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::SetDevice(int)+0xd8) [0x7f6fe733cc98] [bt] (2) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0x48) [0x7f6fe733cd08] [bt] (3) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x10e) [0x7f6fe736086e] [bt] (4) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x57) [0x7f6fe7363197] [bt] (5) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(+0x312c109) [0x7f6fe6c15109] [bt] (6) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::Chunk::Chunk(mxnet::TShape, mxnet::Context, bool, int)+0x198) [0x7f6fe6c3aca8] [bt] (7) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x97) [0x7f6fe6c3b757] [bt] (8) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::InitZeros(mxnet::NDArrayStorageType, mxnet::TShape const&, mxnet::Context const&, int)+0x58) [0x7f6fe6c3b978]

terminate called without an active exception terminate called recursively Aborted (core dumped)

RogerChern commented 5 years ago

It seems your change of gpu does not take effect since the total iter and lr_iters is still for 8 GPUs. If the change is successful these iters should be multiplied by 8.

08-28 12:55:37 total iter 87942
08-28 12:55:37 lr 0.02, lr_iters [60000, 80000]
shilpi2015 commented 5 years ago

to train on single GPU I have changed line no 36 of config file tridentnet_r50v1c4_c5_1x.py. I have changed gpus = [0, 1, 2, 3, 4, 5, 6, 7] to gpus = [0].

Can you please suggest what else I have to change for training on single GPU ?

RogerChern commented 5 years ago

@shilpi2015 , change the line definitely should work. But in your case, it does not work. Could you please try to print the gpu field in the detection_train.py

shilpi2015 commented 5 years ago

I have displayed GPU field. Output is [gpu(0), gpu(1), gpu(2), gpu(3), gpu(4), gpu(5), gpu(6), gpu(7)]

It seems changes made in Config file are not reflecting.

class KvstoreParam: kvstore = "local" batch_image = General.batch_image

gpus = [0, 1, 2, 3, 4, 5, 6, 7]

gpus        = [0]
    fp16        = General.fp16

local Kv store is still considering 8 GPUs

RogerChern commented 5 years ago

This is quite strange. Could you please delete the .pyc or pycache and try again?

shilpi2015 commented 5 years ago

thank you so much Roger ... deleting .pyc solved the issue . Training is going on .