google-research / ssl_detection

Semi-supervised learning for object detection
Apache License 2.0
408 stars 76 forks source link

Training on a single GPU (Losses keep fluctuating and do not converge) #31

Open nuschandra opened 3 years ago

nuschandra commented 3 years ago

Hi,

I am training the Faster RCNN model on 10% of labelled COCO data. It seems like while training with 1 GPU, the losses don't converge and based on an earlier issue (https://github.com/google-research/ssl_detection/issues/12), I understand that with 1 GPU and a batch size of 1 due to tensorpack constaints, the batch size may be too small for the network to train and converge. If that's the case, what are the alternatives? Is the only alternative to move away from tensorpack in order to be able to use a larger batch size?

Any inputs/suggestions are more than welcome as I am a bit stuck at the moment and do not have access to more than 1 GPU.

Regards, Chandra

Shuixin-Li commented 1 year ago

same question, I even got my losses in nan, what happen? (Actually, I don't know how to know the number of GPUs but when I checked the GPU, the computer only came out with one name, so I guess I only have one GPU. )

@nuschandra , have you solved this problem? Any comment and advice are welcome ( TAT, a crying face

here is part of the log

[0621 21:52:52 @base.py:283] Epoch 142 (global_step 71000) finished, time:47.1 seconds.
[0621 21:52:52 @misc.py:109] Estimated Time Left: 1 day 12 hours 11 minutes 51 seconds
[0621 21:52:52 @monitor.py:474] GPUUtil/0: 75.174
[0621 21:52:52 @monitor.py:474] HostFreeMemory (GB): 233.03
[0621 21:52:52 @monitor.py:474] PeakMemory(MB)/gpu:0: 2020.6
[0621 21:52:52 @monitor.py:474] QueueInput/queue_size: 50
[0621 21:52:52 @monitor.py:474] Throughput (samples/sec): 10.624
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss: 0.045659
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss_debug: 0.045659
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/box_loss_unormalized: 2.9222
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/detect_empty_labels: 0
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_loss: 0.45146
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/accuracy: 0.92609
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/false_negative: 1
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/label_metrics/fg_accuracy: 2.3493e-37
[0621 21:52:52 @monitor.py:474] fastrcnn_losses/num_fg_label: 4.73
[0621 21:52:52 @monitor.py:474] learning_rate: 0.01
[0621 21:52:52 @monitor.py:474] mean_gt_box_area: 26681
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level2: 63.574
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level3: 0.1091
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level4: 0.31203
[0621 21:52:52 @monitor.py:474] multilevel_roi_align/fpn_map_rois_to_levels/num_roi_level5: 0.0048982
[0621 21:52:52 @monitor.py:474] rpn_losses/box_loss: 0.033246
[0621 21:52:52 @monitor.py:474] rpn_losses/label_loss: 0.21476
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/box_loss: 0.023036
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_loss: 0.15323
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.1: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.2: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/label_metrics/recall_th0.5: 0.29284
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/num_pos_anchor: 10.744
[0621 21:52:52 @monitor.py:474] rpn_losses/level2/num_valid_anchor: 198.3
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/box_loss: 0.0064434
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_loss: 0.041088
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.1: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.2: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/label_metrics/recall_th0.5: 0.39949
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/num_pos_anchor: 3.0853
[0621 21:52:52 @monitor.py:474] rpn_losses/level3/num_valid_anchor: 46.819
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/box_loss: 0.0014308
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_loss: 0.0070757
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.1: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.2: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/label_metrics/recall_th0.5: 0.44505
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/num_pos_anchor: 0.51146
[0621 21:52:52 @monitor.py:474] rpn_losses/level4/num_valid_anchor: 8.6195
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/box_loss: 0.0020379
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_loss: 0.01154
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.1: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.2: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/label_metrics/recall_th0.5: 0.38672
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/num_pos_anchor: 1.0808
[0621 21:52:52 @monitor.py:474] rpn_losses/level5/num_valid_anchor: 2.0708
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/box_loss: 0.00029753
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_loss: 0.0018269
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.1: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.2: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/precision_th0.5: 0.5
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.1: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.2: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/label_metrics/recall_th0.5: 0.46507
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/num_pos_anchor: 0.18924
[0621 21:52:52 @monitor.py:474] rpn_losses/level6/num_valid_anchor: 0.19345
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/num_bg: 59.27
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/num_fg: 4.73
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/best_iou_per_gt/Merge: 0.11147
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/recall_iou0.3: 0.12741
[0621 21:52:52 @monitor.py:474] sample_fast_rcnn_targets/proposal_metrics/recall_iou0.5: 0.05331
[0621 21:52:52 @monitor.py:474] total_cost: nan
[0621 21:52:52 @monitor.py:474] wd_cost: nan
Shuixin-Li commented 1 year ago

@zizhaozhang Thank you for your hard work on this, could you please help with this problem? Or actually, this cannot run on custom data with one GPU?