Training process is stuck

lyuweiwang commented 5 years ago

When I run "python detection_train.py --config config/NASFPN/retina_r50v1b_nasfpn_1280_7\@384_25epoch.py", the process is stuck. Maybe the process is initialize the params of the model because the last output is as following. 09-03 19:35:26 Initialized cls_conv4_bn_s32_moving_var as ["one", {}]: 1.0 09-03 19:35:26 Initialized cls_conv4_bn_s64_moving_mean as ["zero", {}]: 0.0 09-03 19:35:26 Initialized cls_conv4_bn_s64_moving_var as ["one", {}]: 1.0 09-03 19:35:26 Initialized cls_conv4_bn_s8_moving_mean as ["zero", {}]: 0.0 09-03 19:35:26 Initialized cls_conv4_bn_s8_moving_var as ["one", {}]: 1.0

Besides, I use my own data.

RogerChern commented 5 years ago

I can not reproduce this bug. Could you please provide more information about your software and hardware. Does this config run on the COCO dataset?

lyuweiwang commented 5 years ago

I tried all NAS-FPN config files and only 640 worked. I use both COCO dataset and my dataset. My hardware and software is listed bellow:

GPU: 4 * P40 CUDA 9 CUDNN 7.0.5 Setup from Scratch CentOS 7 Python 3.7.3 mxnet 1.6.0

lyuweiwang commented 5 years ago

I change gpus and image_set in config file. when I use my dataset, I change max_num_gt to 500.

lyuweiwang commented 5 years ago

Just I find another similar problem. When I run tridentnet, the process is stuck at epoch 0, batch 20340. However I don't know if there is some relation between these two problems.

RogerChern commented 5 years ago

If you can only run 640 variant, maybe syncbn is not functioning correctly.

RogerChern commented 5 years ago

Try the pre-built python wheel to ensure you get the right operator.

lyuweiwang commented 5 years ago

I guess you mean this "mxnet_cu100-1.6.0b20190820-py2.py3-none-manylinux1_x86_64.whl". Is there a cuda9 version? Besides, How can I solve the problem that process is stuck at epoch 0, batch 20340 when I run tridentnet?

RogerChern commented 5 years ago

Here is the wheel for cuda9. https://github.com/TuSimple/simpledet/blob/master/doc/INSTALL.md#setup-locally-with-pre-built-wheel

RogerChern commented 5 years ago

The stuck may be caused by some image not correctly read in. Is this happening near the end of one epoch.

lyuweiwang commented 5 years ago

I have cleaned my environment and changed wheel for cuda9, however, nothing is changed. when I change syncbn to local bn in retina_r50v1b_nasfpn_1280_7@384_25epoch.py, new error occurs.

Traceback (most recent call last): File "detection_train.py", line 278, in <module> train_net(parse_args()) File "detection_train.py", line 259, in train_net profile=profile File "/nfs/project/lyuwei/project/simpledet/core/detection_module.py", line 1010, in fit self.update_metric(eval_metric, data_batch.label) File "/nfs/project/lyuwei/project/simpledet/core/detection_module.py", line 789, in update_metric self._exec_group.update_metric(eval_metric, labels, pre_sliced) File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/module/executor_group.py", line 640, in update_metric eval_metric.update_dict(labels_, preds) File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/metric.py", line 350, in update_dict metric.update_dict(labels, preds) File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/metric.py", line 133, in update_dict self.update(label, pred) File "/nfs/project/lyuwei/project/simpledet/models/retinanet/metric.py", line 34, in update pred_label = pred_label.asnumpy().astype('int32') File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2392, in asnumpy ctypes.c_size_t(data.size))) File "/tmp-data/lyuwei/miniconda3/lib/python3.7/site-packages/mxnet/base.py", line 254, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [17:08:44] src/ndarray/ndarray_function.cu:45: Check failed: to->type_flag_ == from.type_flag_ (0 vs. 6) : Source and target must have the same data type when copying across devices.

lyuweiwang commented 5 years ago

The stuck may be caused by some image not correctly read in. Is this happening near the end of one epoch.

I have tested all the images. There are no corrupted images or missing images. The stuck seems to happen near the end of one epoch. I have 320,000 images and use 6 images per batch, 4 gpus. So as the code indicates, one epoch should end when batch 13,333. However, the epoch does not end until 20,000 batches.

RogerChern commented 5 years ago

We pre-flip images in the roidb, so each epoch in simpledet counts for 2 epochs. I can confirm the NASFPN issue and I will look into it soon.

lyuweiwang commented 5 years ago

We pre-flip images in the roidb, so each epoch in simpledet counts for 2 epochs. I can confirm the NASFPN issue and I will look into it soon.

The problem of stuck at 20,000 batches happens on tridentnet_r101v2c4_c5_multiscale_addminival_3x_fp16.py. As your explanation, the stuck really happens near the end of epoch. Thus, are there some solutions?

tusen-ai / simpledet

Training process is stuck #220