Training Failure - Githubissues

MedImam commented 1 year ago

Before Asking

[X] I have read the README carefully. 我已经仔细阅读了README上的操作指引。
[X] I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

[X] I have searched the YOLOv6 issues and found no similar questions.

Question

when i run training for 10 epochs , i had this error in epoch 5 :

Epoch iou_loss dfl_loss cls_loss

0%| | 0/1566 [00:00<?, ?it/s]
5/9 0.5002 0 1.839: 0%| | 0/1566 [00:00<?, ?it/ 5/9 0.5002 0 1.839: 0%| | 1/1566 [00:00<10:14, C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [29,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [29,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed.

5/9 0.5002 0 1.839: 0%| | 1/1566 [00:01<26:59, ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 99, in train self.train_in_loop(self.epoch) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\models\loss.py", line 112, in call loss_iou, loss_dfl = self.bbox_loss(pred_distri, pred_bboxes, anchor_points_s, target_bboxes, File "C:\Users\MohamedIMAM\anaconda3\envs\pytorch_env\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\models\loss.py", line 167, in forward if num_pos > 0: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\tools\train.py", line 126, in main(args) File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\tools\train.py", line 116, in main trainer.train() File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 106, in train self.train_after_loop() File "C:\Users\MohamedIMAM\notebooks\Yolov6_test\YOLOv6\yolov6\core\engine.py", line 297, in train_after_loop torch.cuda.empty_cache() File "C:\Users\MohamedIMAM\anaconda3\envs\pytorch_env\lib\site-packages\torch\cuda\memory.py", line 114, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Additional

No response

mtjhl commented 1 year ago

Sorry for the inconvenience, we will try to reproduce this problem and fix it.

guoshuhong commented 1 year ago

the same problem

   (act): SiLU(inplace=True)
  )
)
(cls_preds): ModuleList(
  (0): Conv2d(32, 80, kernel_size=(1, 1), stride=(1, 1))
  (1): Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1))
  (2): Conv2d(128, 80, kernel_size=(1, 1), stride=(1, 1))
)
(reg_preds): ModuleList(
  (0): Conv2d(32, 4, kernel_size=(1, 1), stride=(1, 1))
  (1): Conv2d(64, 4, kernel_size=(1, 1), stride=(1, 1))
  (2): Conv2d(128, 4, kernel_size=(1, 1), stride=(1, 1))
)

) ) Training start...

 Epoch  iou_loss  dfl_loss  cls_loss

0%| | 0/14786 [00:00<?, ?it/s] /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [33,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [34,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [35,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [36,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [37,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [38,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [39,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [40,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [41,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [42,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [43,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [44,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [45,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:118: operator(): block: [4,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed. 0%| | 0/14786 [00:01<?, ?it/s]

guoshuhong commented 1 year ago

ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 99, in train self.train_in_loop(self.epoch) File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 142, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num) File "/home/shuhong/work/YOLOv6/yolov6/models/loss.py", line 106, in call loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label) File "/home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/shuhong/work/YOLOv6/yolov6/models/loss.py", line 149, in forward loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') weight).sum() File "/home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/nn/functional.py", line 3083, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/train.py", line 126, in main(args) File "tools/train.py", line 116, in main trainer.train() File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 106, in train self.train_after_loop() File "/home/shuhong/work/YOLOv6/yolov6/core/engine.py", line 297, in train_after_loop torch.cuda.empty_cache() File "/home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/cuda/memory.py", line 121, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f846762920e in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x14675d (0x7f84a937d75d in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: + 0x149a9e (0x7f84a9380a9e in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: + 0x4669f8 (0x7f84b695a9f8 in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f84676107a5 in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x3628c5 (0x7f84b68568c5 in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x67ca08 (0x7f84b6b70a08 in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f84b6b70dd5 in /home/shuhong/miniconda3/envs/torch112/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

Chilicyy commented 1 year ago

@MedImam @guoshuhong Hi, can you provide us with your training commands？ Besides, if you git pull the latest code from master branch, the same error messages still happen?

guoshuhong commented 1 year ago

python tools/train.py --batch 16 --conf configs/yolov6n.py --data data/coco.yaml --epoch 400 --name yolov6n_coco

@MedImam @guoshuhong Hi, can you provide us with your training commands？ Besides, if you git pull the latest code from master branch, the same error messages still happen? I just try to train coco with single gpu python tools/train.py --batch 16 --conf configs/yolov6n.py --data data/coco.yaml --epoch 400 --name yolov6n_coco

guoshuhong commented 1 year ago

@MedImam @guoshuhong Hi, can you provide us with your training commands？ Besides, if you git pull the latest code from master branch, the same error messages still happen?

Suddenly, This problem disappeared, I don't know why - -

MedImam commented 1 year ago

@MedImam @guoshuhong Hi, can you provide us with your training commands？ Besides, if you git pull the latest code from master branch, the same error messages still happen?

!python tools/train.py --batch 1 --epochs 10 --conf configs/yolov6n_finetune.py --data data/dataset.yaml --device 0

MedImam commented 1 year ago

This is the training command that i’m using : !python tools/train.py --batch 1 --epochs 10 --conf configs/yolov6n_finetune.py --data data/dataset.yaml --device 0 Cordially. Mohamed IMAMPhd. Candidate in AI/ML/DL & Computer Vision. LinkedIn : https://ma.linkedin.com/in/mohamed-imam-56a81b165 Envoyé à partir de Courrier pour Windows De : ChilicyyEnvoyé le :dimanche 9 octobre 2022 04:45À : meituan/YOLOv6Cc : MedImam; MentionObjet :Re: [meituan/YOLOv6] Training Failure (Issue ***@***.*** @guoshuhong Hi, can you provide us with your training commands？ Besides, if you git pull the latest code from master branch, the same error messages happen?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Chilicyy commented 1 year ago

@MedImam It may due to the batchsize you use is so small that doesn't match the learning rate. Can you try to increase your batchsize as large as possible and use the latest code from main branch for training again?

587687525 commented 1 year ago

I have 7500 pictures as the data set, and then I will set batchsize to 16, workers to 2, conf file to../config/yolov6m_ inetune.py, after training 400 epochs, found that map0.5 only reached 0.45, so I modified lr0 to 0.12 and lrf to 0.0032 in configs. As usual, when the epoch was 6, the above errors occurred steadily.It is worth mentioning that I hope to converge quickly at the beginning of training, so here I reverse the values of lr0 and lrf, and this error will occur after training again.

haritsahm commented 1 year ago

This error also happen when I start the qat distill training.

Number class = 1
Config: similar to configs/repopt/yolov6s_opt_qat.py
Dataset:
- Train: Final numbers of valid images: 64094/ labels: 64094.
- Val: Final numbers of valid images: 2693/ labels: 2693.

Command

CUDA_LAUNCH_BLOCKING=1 PYTHONWARNINGS="ignore" python tools/train.py --data data/custom-data.yaml --name yolov6s-repopt-custom-data-qat --conf configs/repopt/yolov6s_opt_qat-custom-data.py --quant --distill --distill_feat --batch 32 --workers 14 --epochs 10 --teacher_model_path runs/train/yolov6s-repopt-custom-data/weights/best_ckpt.pt --device 0 --check-images --check-labels

Log

  0%|          | 0/2003 [00:00<?, ?it/s]                                                                                                                                                                    /opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [21,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [22,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [23,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [24,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [25,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [26,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [27,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [28,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [29,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:104: operator(): block: [138,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
  0%|          | 0/2003 [00:01<?, ?it/s]                                                                                                                                                                    
WARNING: Logging before flag parsing goes to stderr.
E1020 02:01:00.629523 139757667850048 engine.py:116] ERROR in training steps.
E1020 02:01:00.629653 139757667850048 engine.py:103] ERROR in training loop or eval/save model.
Traceback (most recent call last):
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 99, in train
    self.train_in_loop(self.epoch)
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 113, in train_in_loop
    self.train_in_steps(epoch_num, self.step)
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 139, in train_in_steps
    total_loss, loss_items = self.compute_loss_distill(preds, t_preds, s_featmaps, t_featmaps, targets, \
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/models/loss_distill.py", line 124, in __call__
    loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/models/loss_distill.py", line 219, in forward
    loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') * weight).sum()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 3030, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 126, in <module>
    main(args)
  File "tools/train.py", line 116, in main
    trainer.train()
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 106, in train
    self.train_after_loop()
  File "/mnt/raid1/haritsah/projects/custom-object/object_detection/yolov6-trainer/yolov6/core/engine.py", line 297, in train_after_loop
    torch.cuda.empty_cache()
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/memory.py", line 114, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
Exception raised from createEvent at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:174 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f1b7319e1dc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe379dd (0x7f1b740649dd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe3b426 (0x7f1b74068426 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x43d67c (0x7f1ba7d1367c in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f1b73187035 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x33a6c9 (0x7f1ba7c106c9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x646f92 (0x7f1ba7f1cf92 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f1ba7f1d315 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: __libc_start_main + 0xf3 (0x7f1bde2800b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

haritsahm commented 1 year ago

Quick update, the preds output in preds, s_featmaps = self.model(images) are tensor of nans. Any idea how this might happen?

AhmedShahhatAl commented 1 year ago

did anyone solved this problem? i am getting the same error which seems due to the loss function as i used CUDA_LAUNCH_BLOCKING=1 that showed me from where the error is coming, which is ( return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered ) the woirred thing if i use the cpu i do not get this error but with the gpu yes, my command fro training is , python train.py --batch 8 --conf configs/yolov6s_finetune.py --data-path data/dataset.yaml --device 0 --epochs 2 --eval-interval 2, note i used different batch sizes and the error still there

haritsahm commented 1 year ago

@AhmedShahhatAl , I'm not sure about it, but I was able to solve it by using extra class. My model should detect 1 object but I add extra class to make it 2 class object detection (see my related issue above). How many class that you have?

AhmedShahhatAl commented 1 year ago

@haritsahm i have 2 classes, did you just modified the num of classes without adding a label for it ? or how you did that exactly

haritsahm commented 1 year ago

@AhmedShahhatAl I didn't modified the code. I use real labels since it will affect the detection result

altefwan commented 11 months ago

I'm running a dual-gpu setup. I encountered this issue on my 1650 super but was able to successfully train on my 1070.

I'm following this guide: https://github.com/meituan/YOLOv6/blob/main/docs/tutorial_repopt.md

CuriousTank commented 7 months ago

Maybe your set wrong number of classes in dataset.yaml, or classes names list's length not equal to number of classes. After fix the upper problem, my trainning script can successfully run.

du-atavi commented 6 months ago

Encountered the same issue and I finally fixed it by adding target_scores = torch.clamp(target_scores, min=0, max=1) after line 81 in yolov6/assigners/tal_assigner.py. Took me a while to figure it out, but somehow the target_scores values were out of bounds for my training dataset.

Yahya-Younes commented 5 months ago

I encouter the same error while training a yolov6l on a 10k images, batch 20 or 32 on GPU and at the very same epoch (24) it fails training and shows this error : Assertion target_val >= zero && target_val <= one failed.

24/499   0.001991    0.1656    0.2771    0.4584:  16%|█▌        | 81/500 [00:19<01:41,  4.11it/s

ERROR in training steps.

ERROR in training loop or eval/save model.... I thought it was due to some non normalized bounding boxes but i checekd all of them by a script and they were all good.

isHarryh commented 1 month ago

I encountered the same error. I SOLVED this problem. The following information may be helpful for you.

Problem Reproduction

When I use CUDA to train, I got this stacktrace:

Traceback (most recent call last):
  File "tools/train.py", line 145, in <module>
    main(args)
  File "tools/train.py", line 135, in main
    trainer.train()
  File "......\yolov6\core\engine.py", line 129, in train
    self.train_after_loop()
  File "......\yolov6\core\engine.py", line 358, in train_after_loop
    torch.cuda.empty_cache()
  File "......\lib\site-packages\torch\cuda\memory.py", line 125, in empty_cache 
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered

Every time when my training get into a determined epoch (say, epoch 4), this error occurred.

Then I tried to use CPU to train to see the detailed error message. Using this method, I learned that IndexError caused this failure, that one label in my dataset has ID 33 (ID starts from 0, so it should be 32) while I only has 33 classes.

My Solution

Check your labels and classes. Make sure that you have the right number of classes and class id.
Correct the mistakes you found above in your dataset.
Clear the training cache completely.
Restart your training.

meituan / YOLOv6

Training Failure #535

Before Asking

Search before asking

Question

Additional

Command

Log

Problem Reproduction

My Solution