Training stop after 600epoch

williamhyin commented 2 years ago

Hi ,

I am trying to train distance_semantic_detection_motion model based on default params.yaml. But the training process stopped after 1 epochs always. There is no error reported, the training process did not move, all the gpu-utils are 0%. I was using DP multi-gpu training setting, because single gpu v100 cannot fulfill batch size 22.

Problem: epoch 0 | batch 0 | current lr 0.0001 | examples/s: 0.6 | loss: 215.47714 | time elapsed: 00h00m51s | time| CPU/GPU time: 8.8s/35.0s epoch 0 | batch 300 | current lr 0.0001 | examples/s: 12.8 | loss: 20.83787 | time elapsed: 00h09m49s | time CPU/GPU time: 0.1s/529.1s [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) epoch 0 | Semantic IoU: 0.349 => Saving semantic segmentation model weights with mean_iou of 0.349 at step 300 on 0 epoch. [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) => Saving detection model weights with mean_AP of 0.010 at step 300 on 0 epoch. => meanAP per class in order: [0.03, 0.0, 0.0] => Detection val mAP 0.010 [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) epoch 0 | Motion IoU: 0.524 => Saving motion model weights with mean_iou of 0.524 at step 300 on 0 epoch.

williamhyin commented 2 years ago

When I CTRL+C to stop this process, I get the traceback.

^CTraceback (most recent call last): File "./main.py", line 109, in main() File "./main.py", line 93, in main model.distance_semantic_detection_motion_train() File "/workspace/WoodScape/omnidet/train_distance_semantic_detection_motion.py", line 77, in distance_semantic_detection_motion_train self.save_best_detection_weights() File "/workspace/WoodScape/omnidet/train_distance_semantic_detection.py", line 134, in save_best_detection_weights self.args.input_height]) File "/root/miniconda3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, *kwargs) File "/workspace/WoodScape/omnidet/train_detection.py", line 178, in detection_val outputs = non_max_suppression(outputs, conf_thres=conf_thres, nms_thres=nms_thres) File "/workspace/WoodScape/omnidet/train_utils/detection_utils.py", line 241, in non_max_suppression large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres File "/workspace/WoodScape/omnidet/train_utils/detection_utils.py", line 205, in bbox_iou b2_area = (b2_x2 - b2_x1 + 1) (b2_y2 - b2_y1 + 1)

Is there something wrong?

Dorablank commented 2 years ago

Hi, were you able to solve this issue?

valeoai / WoodScape

Training stop after 600epoch #79