ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.55k stars 16.3k forks source link

There is a problem with training a model trained with a custom dataset as a pretrained weight. #2439

Closed leeyunhome closed 3 years ago

leeyunhome commented 3 years ago

❔Question

Hello,

There is a problem with training a model trained with a custom dataset as a pretrained weight

The best.pt specified by --weights is a model that has been trained with yolov5s.pt as the weight with 640 image size.

python3 train.py --img 320 --batch 16 --epochs 200 --data /home/yhlee/coding/GitHub/dataset/data.yaml --cfg ./models/yolov5s.yaml --weights /home/yhlee/coding/GitHub/yolov3/runs/train/lpr_result5/weights/best.pt --name lpr_result remote: Enumerating objects: 6, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (4/4), done. remote: Total 6 (delta 2), reused 6 (delta 2), pack-reused 0 Unpacking objects: 100% (6/6), done. From https://github.com/ultralytics/yolov5 80dbb96..980443b multigpu_test -> origin/multigpu_test Your branch is behind 'origin/master' by 133 commits, and can be fast-forwarded. (use "git pull" to update your local branch)

Using torch 1.7.1 CUDA:0 (GeForce RTX 3080, 10016.75MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov5s.yaml', data='/home/yhlee/coding/GitHub/dataset/data.yaml', device='', epochs=200, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[320, 320], local_rank=-1, log_artifacts=False, log_imgs=16, multi_scale=False, name='lpr_result', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/lpr_result25', single_cls=False, sync_bn=False, total_batch_size=16, weights='/home/yhlee/coding/GitHub/yolov3/runs/train/lpr_result5/weights/best.pt', workers=8, world_size=1) Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/ Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0} Overriding model.yaml nc=80 with nc=102

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 288579 models.yolo.Detect [102, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7527491 parameters, 7527491 gradients, 17.7 GFLOPS

Transferred 40/370 items from /home/yhlee/coding/GitHub/yolov3/runs/train/lpr_result5/weights/best.pt Optimizer groups: 62 .bias, 70 conv.weight, 59 other wandb: Currently logged in as: hodu (use wandb login --relogin to force relogin) wandb: wandb version 0.10.22 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.10.17 wandb: Syncing run lpr_result25 wandb: ⭐️ View project at https://wandb.ai/hodu/YOLOv5 wandb: 🚀 View run at https://wandb.ai/hodu/YOLOv5/runs/3790r2p0 wandb: Run data is saved locally in /home/yhlee/coding/GitHub/yolov5/wandb/run-20210312_153548-3790r2p0 wandb: Run wandb offline to turn off syncing.

Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 147, in train optimizer.load_state_dict(ckpt['optimizer']) File "/home/yhlee/anaconda3/envs/yolov3_env/lib/python3.8/site-packages/torch/optim/optimizer.py", line 124, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

wandb: Waiting for W&B process to finish, PID 5192 wandb: Program failed with code 1. Press ctrl-c to abort syncing. wandb:
wandb: Find user logs for this run at: /home/yhlee/coding/GitHub/yolov5/wandb/run-20210312_153548-3790r2p0/logs/debug.log wandb: Find internal logs for this run at: /home/yhlee/coding/GitHub/yolov5/wandb/run-20210312_153548-3790r2p0/logs/debug-internal.log wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: wandb: Synced lpr_result25: https://wandb.ai/hodu/YOLOv5/runs/3790r2p0

====================================== Can you tell me how to solve this problem?

Additional context

glenn-jocher commented 3 years ago

@leeyunhome there are no known problems in the training workflows you describe.

In any case your output shows that your code is out of date by 133 commits.

👋 Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

CODE TO REPRODUCE YOUR ISSUE HERE


- **Your custom data.** If your issue is not reproducible in one of our 3 common datasets ([COCO](https://github.com/ultralytics/yolov5/blob/master/data/coco.yaml), [COCO128](https://github.com/ultralytics/yolov5/blob/master/data/coco128.yaml), or [VOC](https://github.com/ultralytics/yolov5/blob/master/data/voc.yaml)) we can not debug it. Visit our [Custom Training Tutorial](https://docs.ultralytics.com/yolov5/tutorials/train_custom_data) for guidelines on training your custom data. Examine `train_batch0.jpg` and `test_batch0.jpg` for a sanity check of your labels and images.

- **Your environment.** If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, verify your environment meets all of the [requirements.txt](https://github.com/ultralytics/yolov5/blob/master/requirements.txt) dependencies specified below. If in doubt, download Python 3.8.0 from https://www.python.org/, create a new [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/), and install requirements.

If none of these apply to you, we suggest you close this issue and raise a new one using the 🐛 **Bug Report template**, providing screenshots and a [minimum reproducible example](https://docs.ultralytics.com/help/minimum_reproducible_example/) of your issue. Thank you!

## Requirements

Python 3.8 or later with all [requirements.txt](https://github.com/ultralytics/yolov5/blob/master/requirements.txt) dependencies installed, including `torch>=1.7`. To install run:
```bash
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

CarlChaaya commented 2 years ago

Did you solve it? If yes, how?