gofugoo commented 3 years ago

I trained my custom data on google lab.

python3.8 train.py --data ../data_bbox/data2/custom.yaml --cfg ../data_bbox/data2/yolov5s.yaml --weights ../data_bbox/data2/ys_best_2020_11_15.pt --batch-size 200 --epochs 16000 --rect --img-size 512 --cache-images --hyp runs/evolve/hyp_evolved.yaml --resume

Scanning images: 100% 1455/1455 [21:11<00:00,  1.14it/s]
Scanning labels ../data_bbox/data2/labels.cache (1455 found, 0 missing, 0 empty, 0 duplicate, for 1455 images): 1455it [00:00, 12209.94it/s]
Caching images (0.6GB): 100% 1455/1455 [01:30<00:00, 16.12it/s]
Scanning images: 100% 99/99 [01:29<00:00,  1.11it/s]
Scanning labels ../data_bbox/data2/labels.cache (99 found, 0 missing, 0 empty, 0 duplicate, for 99 images): 99it [00:00, 11718.25it/s]
Caching images (0.0GB): 100% 99/99 [00:03<00:00, 26.93it/s]
Image sizes 512 train, 512 test
Using 2 dataloader workers
Logging results to runs/exp4
Starting training for 16000 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
9780/15999     13.9G    0.2292   0.02814   0.03342    0.2907       298       512:  50% 4/8 [00:05<00:05,  1.37s/it]
Traceback (most recent call last):
  File "train.py", line 460, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 278, in train
    scaler.step(optimizer)  # optimizer.step
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0

glenn-jocher commented 3 years ago

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

Your modified or out-of-date code. If your issue is not reproducible in a new git clone version of this repo we can not debug it. Before going further run this code and verify your issue persists:
```
$ git clone https://github.com/ultralytics/yolov5 yolov5_new  # clone latest
$ cd yolov5_new
$ python detect.py  # verify detection
```

CODE TO REPRODUCE YOUR ISSUE HERE


- **Your custom data.** If your issue is not reproducible in one of our 3 common datasets ([COCO](https://github.com/ultralytics/yolov5/blob/master/data/coco.yaml), [COCO128](https://github.com/ultralytics/yolov5/blob/master/data/coco128.yaml), or [VOC](https://github.com/ultralytics/yolov5/blob/master/data/voc.yaml)) we can not debug it. Visit our [Custom Training Tutorial](https://docs.ultralytics.com/yolov5/tutorials/train_custom_data) for guidelines on training your custom data. Examine `train_batch0.jpg` and `test_batch0.jpg` for a sanity check of your labels and images.

- **Your environment.** If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, verify your environment meets all of the [requirements.txt](https://github.com/ultralytics/yolov5/blob/master/requirements.txt) dependencies specified below. If in doubt, download Python 3.8.0 from https://www.python.org/, create a new [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/), and then install requirements.

If none of these apply to you, we suggest you close this issue and raise a new one using the **Bug Report template**, providing screenshots and **minimum viable code to reproduce your issue**. Thank you!

## Requirements

Python 3.8 or later with all [requirements.txt](https://github.com/ultralytics/yolov5/blob/master/requirements.txt) dependencies installed, including `torch>=1.6`. To install run:
```bash
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/models/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

gofugoo commented 3 years ago

@glenn-jocher yes , i have tried to download the newest codes of master branch, when training with the "--resume" parameter, the same issue will occur.

Using torch 1.7.0+cu101 CUDA:0 (Tesla V100-SXM2-16GB, 16130MB)

Namespace(adam=False, batch_size=200, bucket='', cache_images=True, cfg='', data='../data_bbox/data2/custem.yaml', device='', epochs=16000, evolve=False, exist_ok=False, global_rank=-1, hyp='hyps/hyp_evolved.yaml', image_weights=False, img_size=[512, 512], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=True, resume=True, save_dir='runs/train/exp', single_cls=False, sync_bn=False, total_batch_size=200, weights='./runs/train/exp/weights/last.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
2020-11-17 09:22:20.830952: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Hyperparameters {'lr0': 0.0121, 'lrf': 0.219, 'momentum': 0.94, 'weight_decay': 0.00043, 'warmup_epochs': 2.19, 'warmup_momentum': 0.95, 'warmup_bias_lr': 0.0836, 'box': 0.0644, 'cls': 0.52, 'cls_pw': 0.811, 'obj': 0.947, 'obj_pw': 1.48, 'iou_t': 0.2, 'anchor_t': 4.53, 'anchors': 4.68, 'fl_gamma': 0.0, 'hsv_h': 0.0124, 'hsv_s': 0.798, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.119, 'scale': 0.515, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 0.522, 'mixup': 0.0}

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1     35960  models.yolo.Detect                      [3, [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [128, 256, 512]]
Model Summary: 283 layers, 7274872 parameters, 7274872 gradients

Transferred 362/370 items from ./runs/train/exp/weights/last.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
wandb: Currently logged in as: googled (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.10
wandb: Resuming run exp
wandb: ⭐️ View project at https://wandb.ai/googled/YOLOv5
wandb: 🚀 View run at https://wandb.ai/googled/YOLOv5/runs/1va6x8kd
wandb: Run data is saved locally in /content/drive/My Drive/dpworkspace/yolov5-master/wandb/run-20201117_092228-1va6x8kd
wandb: Run `wandb off` to turn off syncing.

Scanning images: 100%|██████████| 1455/1455 [00:02<00:00, 491.05it/s]
Scanning labels ../data_bbox/data2/labels.cache (1455 found, 0 missing, 0 empty, 0 duplicate, for 1455 images): 1455it [00:00, 9963.03it/s]
Caching images (0.6GB): 100%|██████████| 1455/1455 [00:30<00:00, 48.16it/s]
Scanning images: 100%|██████████| 99/99 [00:00<00:00, 361.75it/s]
Scanning labels ../data_bbox/data2/labels.cache (99 found, 0 missing, 0 empty, 0 duplicate, for 99 images): 99it [00:00, 5609.93it/s]
Caching images (0.0GB): 100%|██████████| 99/99 [00:03<00:00, 31.84it/s]
Image sizes 512 train, 512 test
Using 2 dataloader workers
Logging results to runs/train/exp
Starting training for 16000 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  20/15999     13.6G    0.2277   0.02139   0.03305    0.2822       356       512:  38%|███▊      | 3/8 [00:05<00:13,  2.73s/it]Traceback (most recent call last):
  File "train.py", line 490, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 296, in train
    scaler.step(optimizer)  # optimizer.step
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0

glenn-jocher commented 3 years ago

@gofugoo FYI --resume accepts zero additional arguments. Your only option when using it are:

python train.py --resume  # from most recent last.pt
python train.py --resume path/to/last.pt

gofugoo commented 3 years ago

@gofugoo FYI --resume accepts zero additional arguments. Your only option when using it are:
python train.py --resume  # from most recent last.pt
python train.py --resume path/to/last.pt

thanks for reminding me, training without "--resume" seems to be fine, but it cannot work with the "--resume" parameter. Anything else to me?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ultralytics / yolov5

The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0 #1412

CODE TO REPRODUCE YOUR ISSUE HERE

Environments

Status