Training failed using yolov6l on 1GPU, Assertion `target_val >= zero && target_val <= one` failed, Data is verified but still training fails

Yahya-Younes commented 4 months ago

Before Asking

[X] I have read the README carefully. 我已经仔细阅读了README上的操作指引。
[X] I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

[X] I have searched the YOLOv6 issues and found no similar questions.

Question

I train yolov6l cloned from this repo on my custom dataset on a GPU for 350 or 500 epochs with different batchs but each time the training fails showing this error and when i resume it continues training for a while then stops and each time it trains for less epochs until reaching 105 epochs where it can't continue training at all.

I verified my data and labels they are nomalized, the gpu i am using runs very well using Yolov5 or Yolov8 but i don't know what's the problem here !

I have another questions please can we do earlystopping in this yolov6 ? like the parameter patience in Yolov5 for eg.

Thank you so much for your help! When i launch training image (1)

When i resume training img record infomation path is:../dataset/images/.train_cache.json Train: Final numbers of valid images: 10000/ labels: 10000. 0.6s for dataset initialization. img record infomation path is:../dataset/images/.validation_cache.json Convert to COCO format 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1036/1036 [00:00<00:00, 5118.47it/s] Convert to COCO format finished. Resutls saved in ../dataset/annotations/instances_validation.json Val: Final numbers of valid images: 1036/ labels: 1036. 0.5s for dataset initialization. Training start...

 Epoch        lr  iou_loss  dfl_loss  cls_loss

105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:45, 2.88it/s../aten/src/ATen/native/cuda/Loss.cu:95: operator(): block: [12844,0,0], thread: [32,0,0] Assertion target_val >= zero && target_val <= one failed. 105/349 0.006246 0.1285 0.2691 0.3886: 24%|██▍ | 96/400 [00:37<01:58, 2.57it/s ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "/partage//****//app/YOLOv6/yolov6/core/engine.py", line 121, in train self.train_one_epoch(self.epoch) File "/partage/*/***//app/YOLOv6/yolov6/core/engine.py", line 135, in train_one_epoch self.train_in_steps(epoch_num, self.step) File "/partage/**/**/****/app/YOLOv6/yolov6/core/engine.py", line 169, in train_in_steps total_loss, loss_items = self.compute_loss(preds, targets, epoch_num, step_num, File "/partage//***/**/app/YOLOv6/yolov6/models/losses/loss.py", line 163, in call loss_cls = self.varifocal_loss(pred_scores, target_scores, one_hot_label) File "/home/***/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "//*/***/*/app/YOLOv6/yolov6/models/losses/loss.py", line 209, in forward loss = (F.binary_cross_entropy(pred_score.float(), gt_score.float(), reduction='none') weight).sum() File "/home/*/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3127, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/partage/*/**/****/app/YOLOv6/tools/train.py", line 143, in

Additional

No response

Dingerscat commented 4 months ago

学习率过大，模型跑飞了，调小点

Yahya-Younes commented 3 months ago

First thank you for your answer ! but i changed the lr in the configs/yolov6l.py file as follows : solver=dict( optim='SGD', lr_scheduler='Cosine', lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=5.0, warmup_momentum=0.8, warmup_bias_lr=0.05 ) and still encouter the same error

meituan / YOLOv6