meituan / YOLOv6

YOLOv6: a single-stage object detection framework dedicated to industrial applications.
GNU General Public License v3.0
5.67k stars 1.02k forks source link

Training error after modifying lr0 and lrf #547

Closed 587687525 closed 1 year ago

587687525 commented 1 year ago

Before Asking

Search before asking

Question

我有7500张图片作为数据集,然后我将batch_size设置为16,workers设置为2,conf_file设置为../config/yolov6m_finetune.py在训练了400个epoch后,发现map0.5只达到0.45,所以我在configs中将lr0修改为0.12,lrf修改为0.0032,其他参数照常。当epoch为6时,上述错误稳步发生。值得一提的是,我希望在train开始时快速收敛,因此在这里我将lr0和lrf的值颠倒过来,这个错误将在颠倒之后的train发生。

I have 7500 pictures as the data set, and then I will set batchsize to 16, workers to 2, conf file to../config/yolov6m_ inetune.py, after training 400 epochs, found that map0.5 only reached 0.45, so I modified lr0 to 0.12 and lrf to 0.0032 in configs. As usual, when the epoch was 6, the above errors occurred steadily.It is worth mentioning that I hope to converge quickly at the beginning of training, so here I reverse the values of lr0 and lrf, and this error will occur after training again.

===================================================

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [2,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [3,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [4,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [5,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [6,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [7,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [8,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [9,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [10,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [11,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [12,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [13,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [14,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [15,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [16,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [17,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [18,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [19,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [20,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [21,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [22,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [23,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [24,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [25,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. 6/799 2.335 1.791 11.35: 60%|█████▉ | 224/375 [02:31<01:42, ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 99, in train self.train_in_loop(self.epoch) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\models\loss.py", line 155, in call loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label) File "G:\Environment\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\models\loss.py", line 198, in forward loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') weight).sum() File "G:\Environment\Anaconda\lib\site-packages\torch\nn\functional.py", line 3083, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "G:\Project#Fusion\LightAC\trainer\YOLOv6\tools\train.py", line 128, in main(args) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\tools\train.py", line 118, in main trainer.train() File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 106, in train self.train_after_loop() File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 297, in train_after_loop torch.cuda.empty_cache() File "G:\Environment\Anaconda\lib\site-packages\torch\cuda\memory.py", line 121, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered

Process finished with exit code -1073740791 (0xC0000409)

Additional

No response

587687525 commented 1 year ago

我是中国开发者,如果英文使用不方便,可以直接使用中文回复,感谢美团大佬们。

mtjhl commented 1 year ago

这个学习率,lr0 是初始学习率,最终的学习率是 lr0 * lrf 。您把两个倒过来,会导致初始学习率太高了。不过后面这个问题我们最近也在定位,定位到原因会在这里同步给您。

587687525 commented 1 year ago

明白了,我是觉得训练时mAP上升很慢,所以尝试调高学习率提高训练速度

Chilicyy commented 1 year ago

@587687525 学习率太高容易出现训练不稳定,建议调低学习率。