ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.19k stars 3.44k forks source link

WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0') #1539

Closed zxin8218 closed 3 years ago

zxin8218 commented 4 years ago

❔Question

I change the network slightly and cannot use the pre-trained? How can I start training a model from scratch? I use --weights '', but after train 2 epochs, I meet the wrong: WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0') What can i do? thank you for your reply!

Additional context

WARNING: non-finite loss, ending training tensor([nan, nan, 0., nan], device='cuda:0')

glenn-jocher commented 4 years ago

Ultralytics has open-sourced YOLOv5 at https://github.com/ultralytics/yolov5, featuring faster, lighter and more accurate object detection. YOLOv5 is recommended for all new projects.




** GPU Speed measures end-to-end time per image averaged over 5000 COCO val2017 images using a V100 GPU with batch size 32, and includes image preprocessing, PyTorch FP16 inference, postprocessing and NMS. EfficientDet data from google/automl at batch size 8.

Pretrained Checkpoints

Model APval APtest AP50 SpeedGPU FPSGPU params FLOPS
YOLOv5s 37.0 37.0 56.2 2.4ms 416 7.5M 13.2B
YOLOv5m 44.3 44.3 63.2 3.4ms 294 21.8M 39.4B
YOLOv5l 47.7 47.7 66.5 4.4ms 227 47.8M 88.1B
YOLOv5x 49.2 49.2 67.7 6.9ms 145 89.0M 166.4B
YOLOv5x + TTA 50.8 50.8 68.9 25.5ms 39 89.0M 354.3B
YOLOv3-SPP 45.6 45.5 65.2 4.5ms 222 63.0M 118.0B

APtest denotes COCO test-dev2017 server results, all other AP results in the table denote val2017 accuracy.
All AP numbers are for single-model single-scale without ensemble or test-time augmentation. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.001
SpeedGPU measures end-to-end time per image averaged over 5000 COCO val2017 images using a GCP n1-standard-16 instance with one V100 GPU, and includes image preprocessing, PyTorch FP16 image inference at --batch-size 32 --img-size 640, postprocessing and NMS. Average NMS time included in this chart is 1-2ms/img. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.1
All checkpoints are trained to 300 epochs with default settings and hyperparameters (no autoaugmentation). Test Time Augmentation (TTA) runs at 3 image sizes. Reproduce** by python test.py --data coco.yaml --img 832 --augment

For more information and to get started with YOLOv5 please visit https://github.com/ultralytics/yolov5. Thank you!

XxxhCU commented 3 years ago

I also encountered the same problem as you, how did you solve it in the end? Thank you!!

zxin8218 commented 3 years ago

I also encountered the same problem as you, how did you solve it in the end? Thank you!!

I set the Hyperparameters giou=1 in train.py to solve this problem.

hyp = {'giou': 1.0,  # giou loss gain
       'cls': 37.4,  # cls loss gain
       'cls_pw': 1.0,  # cls BCELoss positive_weight
       'obj': 64.3,  # obj loss gain (*=img_size/320 if img_size != 320)
       'obj_pw': 1.0,  # obj BCELoss positive_weight
       'iou_t': 0.20,  # iou training threshold
       'lr0': 0.01,  # initial learning rate (SGD=5E-3, Adam=5E-4)
       'lrf': 0.0005,  # final learning rate (with cos scheduler)
       'momentum': 0.937,  # SGD momentum
       'weight_decay': 0.0005,  # optimizer weight decay
       'fl_gamma': 0.0,  # focal loss gamma (efficientDet default is gamma=1.5)
       'hsv_h': 0.0138,  # image HSV-Hue augmentation (fraction)
       'hsv_s': 0.678,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.36,  # image HSV-Value augmentation (fraction)
       'degrees': 1.98 * 0,  # image rotation (+/- deg)
       'translate': 0.05 * 0,  # image translation (+/- fraction)
       'scale': 0.05 * 0,  # image scale (+/- gain)
       'shear': 0.641 * 0}  # image shear (+/- deg)
glenn-jocher commented 11 months ago

@zxin8218 setting the giou hyperparameter to 1.0 in the train.py file is a valid approach to resolving the non-finite loss issue. This adjustment can help stabilize the training process. Thank you for sharing your solution with the community! If you have any further questions or need additional assistance, feel free to ask.