Training error after modifying lr0 and lrf

587687525 commented 1 year ago

Before Asking

[x] I have read the README carefully. 我已经仔细阅读了README上的操作指引。
[X] I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

[X] I have searched the YOLOv6 issues and found no similar questions.

Question

我有7500张图片作为数据集，然后我将batch_size设置为16，workers设置为2，conf_file设置为../config/yolov6m_finetune.py在训练了400个epoch后，发现map0.5只达到0.45，所以我在configs中将lr0修改为0.12，lrf修改为0.0032，其他参数照常。当epoch为6时，上述错误稳步发生。值得一提的是，我希望在train开始时快速收敛，因此在这里我将lr0和lrf的值颠倒过来，这个错误将在颠倒之后的train发生。

I have 7500 pictures as the data set, and then I will set batchsize to 16, workers to 2, conf file to../config/yolov6m_ inetune.py, after training 400 epochs, found that map0.5 only reached 0.45, so I modified lr0 to 0.12 and lrf to 0.0032 in configs. As usual, when the epoch was 6, the above errors occurred steadily.It is worth mentioning that I hope to converge quickly at the beginning of training, so here I reverse the values of lr0 and lrf, and this error will occur after training again.

===================================================

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [2,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [3,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [4,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [5,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [6,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [7,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [8,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [9,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [10,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [11,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [12,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [13,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [14,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [15,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [16,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [17,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [18,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [19,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [20,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [21,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [22,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [23,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [24,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [25,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:129: block: [91,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. 6/799 2.335 1.791 11.35: 60%|█████▉ | 224/375 [02:31<01:42, ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 99, in train self.train_in_loop(self.epoch) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\models\loss.py", line 155, in call loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label) File "G:\Environment\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\models\loss.py", line 198, in forward loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') weight).sum() File "G:\Environment\Anaconda\lib\site-packages\torch\nn\functional.py", line 3083, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "G:\Project#Fusion\LightAC\trainer\YOLOv6\tools\train.py", line 128, in main(args) File "G:\Project#Fusion\LightAC\trainer\YOLOv6\tools\train.py", line 118, in main trainer.train() File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 106, in train self.train_after_loop() File "G:\Project#Fusion\LightAC\trainer\YOLOv6\yolov6\core\engine.py", line 297, in train_after_loop torch.cuda.empty_cache() File "G:\Environment\Anaconda\lib\site-packages\torch\cuda\memory.py", line 121, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered

Process finished with exit code -1073740791 (0xC0000409)

Additional

No response

587687525 commented 1 year ago

我是中国开发者，如果英文使用不方便，可以直接使用中文回复，感谢美团大佬们。

mtjhl commented 1 year ago

这个学习率，lr0 是初始学习率，最终的学习率是 lr0 * lrf 。您把两个倒过来，会导致初始学习率太高了。不过后面这个问题我们最近也在定位，定位到原因会在这里同步给您。

587687525 commented 1 year ago

明白了，我是觉得训练时mAP上升很慢，所以尝试调高学习率提高训练速度

Chilicyy commented 1 year ago

@587687525 学习率太高容易出现训练不稳定，建议调低学习率。

meituan / YOLOv6