Closed shilpi2015 closed 5 years ago
Could you please tell me how did you set up your environment?
Ubuntu 16.04 Python 3.6 in virtual mode CUDA Version 9.0 CUDNN_MAJOR 7 I have build MXNet from scratch...mxnext and cython according to instructions of readme file
following error is occuring after Merging stage4_unit3_bn3
08-28 12:55:37 total iter 87942 08-28 12:55:37 lr 0.02, lr_iters [60000, 80000] 08-28 12:55:37 lr mode: step 08-28 12:55:37 warmup lr 0.0, warmup step 3000 Traceback (most recent call last): File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/symbol/symbol.py", line 1675, in simple_bind ctypes.byref(exe_handle))) File "/home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/base.py", line 254, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:25:40] src/engine/./../common/cuda_utils.h:319: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal Stack trace: [bt] (0) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f6fe4d6c222] [bt] (1) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::SetDevice(int)+0xd8) [0x7f6fe733cc98] [bt] (2) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0x48) [0x7f6fe733cd08] [bt] (3) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x10e) [0x7f6fe736086e] [bt] (4) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle)+0x57) [0x7f6fe7363197] [bt] (5) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(+0x312c109) [0x7f6fe6c15109] [bt] (6) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::Chunk::Chunk(mxnet::TShape, mxnet::Context, bool, int)+0x198) [0x7f6fe6c3aca8] [bt] (7) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x97) [0x7f6fe6c3b757] [bt] (8) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::InitZeros(mxnet::NDArrayStorageType, mxnet::TShape const&, mxnet::Context const&, int)+0x58) [0x7f6fe6c3b978]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "detection_train.py", line 277, in
terminate called without an active exception terminate called recursively Aborted (core dumped)
It seems your change of gpu does not take effect since the total iter and lr_iters is still for 8 GPUs. If the change is successful these iters should be multiplied by 8.
08-28 12:55:37 total iter 87942
08-28 12:55:37 lr 0.02, lr_iters [60000, 80000]
to train on single GPU I have changed line no 36 of config file tridentnet_r50v1c4_c5_1x.py. I have changed gpus = [0, 1, 2, 3, 4, 5, 6, 7] to gpus = [0].
Can you please suggest what else I have to change for training on single GPU ?
@shilpi2015 , change the line definitely should work. But in your case, it does not work. Could you please try to print the gpu field in the detection_train.py
I have displayed GPU field. Output is [gpu(0), gpu(1), gpu(2), gpu(3), gpu(4), gpu(5), gpu(6), gpu(7)]
It seems changes made in Config file are not reflecting.
class KvstoreParam: kvstore = "local" batch_image = General.batch_image
gpus = [0]
fp16 = General.fp16
local Kv store is still considering 8 GPUs
This is quite strange. Could you please delete the .pyc or pycache and try again?
thank you so much Roger ... deleting .pyc solved the issue . Training is going on .
while training $ python3 detection_train.py --config config/tridentnet_r50v2c4_c5_1x.py following error is generated
RuntimeError: simple_bind error. Arguments: data: (2, 3, 800, 1200) im_info: (2, 3) gt_bbox: (2, 100, 5) valid_ranges: (2, 3, 2) rpn_cls_label: (2, 3, 56250) rpn_reg_target: (2, 3, 60, 50, 75) rpn_reg_weight: (2, 3, 60, 50, 75) [17:35:54] src/engine/./../common/cuda_utils.h:319: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal Stack trace: [bt] (0) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f5790308222] [bt] (1) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::SetDevice(int)+0xd8) [0x7f57928d8c98] [bt] (2) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::cuda::DeviceStore::DeviceStore(int, bool)+0x48) [0x7f57928d8d08] [bt] (3) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x10e) [0x7f57928fc86e] [bt] (4) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle)+0x57) [0x7f57928ff197] [bt] (5) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(+0x312c109) [0x7f57921b1109] [bt] (6) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::Chunk::Chunk(mxnet::TShape, mxnet::Context, bool, int)+0x198) [0x7f57921d6ca8] [bt] (7) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x97) [0x7f57921d7757] [bt] (8) /home/shilpi/shilpi/simpledet-master/work3.6/lib/python3.6/site-packages/mxnet-1.6.0-py3.6.egg/mxnet/libmxnet.so(mxnet::common::InitZeros(mxnet::NDArrayStorageType, mxnet::TShape const&, mxnet::Context const&, int)+0x58) [0x7f57921d7978]
terminate called without an active exception terminate called recursively Aborted (core dumped)
In config file , i have used used gpus = [0] instead of gpus = [0, 1, 2, 3, 4, 5, 6, 7]. Still getting above error. Pl help