tusen-ai / simpledet

A Simple and Versatile Framework for Object Detection and Instance Recognition
Apache License 2.0
3.08k stars 486 forks source link

MXNetError when resume training from checkpoint #329

Closed SimpleXP closed 4 years ago

SimpleXP commented 4 years ago

I have trained my tridentnet for 3 epoch and I stopped. When I try to resume the training from the third epoch checkpoint, I got the following error:

05-26 11:28:46 total iter 131120 05-26 11:28:46 lr 0.01, lr_iters [254440, 361106] 05-26 11:28:46 lr mode: step 05-26 11:29:17 MEM usage: 5960 MiB Traceback (most recent call last): File "detection_train.py", line 313, in File "detection_train.py", line 295, in train_net

File "/home/xpxu/Project/simpledet/core/detection_module.py", line 977, in fit allow_missing=allow_missing, force_init=force_init) File "/home/xpxu/Project/simpledet/core/detection_module.py", line 329, in init_params _impl(desc, arr, arg_params) File "/home/xpxu/Project/simpledet/core/detection_module.py", line 317, in _impl cache_arr.copyto(arr) File "/home/xpxu/.conda/envs/simpledet/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2646, in copyto return _internal._copyto(self, out=other) File "", line 27, in _copyto File "/home/xpxu/.conda/envs/simpledet/lib/python3.7/site-packages/mxnet/_ctypes/ndarray.py", line 107, in _imperative_invoke ctypes.byref(out_stypes))) File "/home/xpxu/.conda/envs/simpledet/lib/python3.7/site-packages/mxnet/base.py", line 278, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [11:29:17] src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 0-th output: expected [1,64,1,1], got [64]

Segmentation fault: 11

Segmentation fault (core dumped)

It reported a MXNetError: Incompatible attr in node at 0-th output: expected [1,64,1,1], got [64]

This error comes from the mod.fit() function in detection_training.py

Anyone has any solution? I don't have this problem before in older version of simpledet. Is it because the mxnet problem or the simpledet problem?

Thanks in advance.

SimpleXP commented 4 years ago

p.s. I use the default tridentnet_r101v2c4_c5_1x.py configuration

SimpleXP commented 4 years ago

I checked the log file, there is no layers that have shape [1, 64, 1, 1], but the gamma, beta, mean and var has the shape [64]