Open lyuweiwang opened 5 years ago
I can not reproduce this bug. Could you please provide more information about your software and hardware. Does this config run on the COCO dataset?
I tried all NAS-FPN config files and only 640 worked. I use both COCO dataset and my dataset. My hardware and software is listed bellow:
GPU: 4 * P40 CUDA 9 CUDNN 7.0.5 Setup from Scratch CentOS 7 Python 3.7.3 mxnet 1.6.0
I change gpus and image_set in config file. when I use my dataset, I change max_num_gt to 500.
Just I find another similar problem. When I run tridentnet, the process is stuck at epoch 0, batch 20340. However I don't know if there is some relation between these two problems.
If you can only run 640 variant, maybe syncbn is not functioning correctly.
Try the pre-built python wheel to ensure you get the right operator.
I guess you mean this "mxnet_cu100-1.6.0b20190820-py2.py3-none-manylinux1_x86_64.whl". Is there a cuda9 version? Besides, How can I solve the problem that process is stuck at epoch 0, batch 20340 when I run tridentnet?
Here is the wheel for cuda9. https://github.com/TuSimple/simpledet/blob/master/doc/INSTALL.md#setup-locally-with-pre-built-wheel
The stuck may be caused by some image not correctly read in. Is this happening near the end of one epoch.
I have cleaned my environment and changed wheel for cuda9, however, nothing is changed. when I change syncbn to local bn in retina_r50v1b_nasfpn_1280_7@384_25epoch.py, new error occurs.
Traceback (most recent call last):
File "detection_train.py", line 278, in <module>
train_net(parse_args())
File "detection_train.py", line 259, in train_net
profile=profile
File "/nfs/project/lyuwei/project/simpledet/core/detection_module.py", line 1010, in fit
self.update_metric(eval_metric, data_batch.label)
File "/nfs/project/lyuwei/project/simpledet/core/detection_module.py", line 789, in update_metric
self._exec_group.update_metric(eval_metric, labels, pre_sliced)
File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/module/executor_group.py", line 640, in update_metric
eval_metric.update_dict(labels_, preds)
File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/metric.py", line 350, in update_dict
metric.update_dict(labels, preds)
File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/metric.py", line 133, in update_dict
self.update(label, pred)
File "/nfs/project/lyuwei/project/simpledet/models/retinanet/metric.py", line 34, in update
pred_label = pred_label.asnumpy().astype('int32')
File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2392, in asnumpy
ctypes.c_size_t(data.size)))
File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/base.py", line 254, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/ndarray/ndarray_function.cu:45: Check failed: to->type_flag_ == from.type_flag_ (0 vs. 6) : Source and target must have the same data type when copying across devices.
The stuck may be caused by some image not correctly read in. Is this happening near the end of one epoch.
I have tested all the images. There are no corrupted images or missing images. The stuck seems to happen near the end of one epoch. I have 320,000 images and use 6 images per batch, 4 gpus. So as the code indicates, one epoch should end when batch 13,333. However, the epoch does not end until 20,000 batches.
We pre-flip images in the roidb, so each epoch in simpledet counts for 2 epochs. I can confirm the NASFPN issue and I will look into it soon.
We pre-flip images in the roidb, so each epoch in simpledet counts for 2 epochs. I can confirm the NASFPN issue and I will look into it soon.
The problem of stuck at 20,000 batches happens on tridentnet_r101v2c4_c5_multiscale_addminival_3x_fp16.py. As your explanation, the stuck really happens near the end of epoch. Thus, are there some solutions?
When I run "python detection_train.py --config config/NASFPN/retina_r50v1b_nasfpn_1280_7\@384_25epoch.py", the process is stuck. Maybe the process is initialize the params of the model because the last output is as following.
09-03 19:35:26 Initialized cls_conv4_bn_s32_moving_var as ["one", {}]: 1.0
09-03 19:35:26 Initialized cls_conv4_bn_s64_moving_mean as ["zero", {}]: 0.0
09-03 19:35:26 Initialized cls_conv4_bn_s64_moving_var as ["one", {}]: 1.0
09-03 19:35:26 Initialized cls_conv4_bn_s8_moving_mean as ["zero", {}]: 0.0
09-03 19:35:26 Initialized cls_conv4_bn_s8_moving_var as ["one", {}]: 1.0
Besides, I use my own data.