The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0 #1412

Closed gofugoo closed 3 years ago

gofugoo commented 3 years ago

I trained my custom data on google lab.

python3.8 train.py --data ../data_bbox/data2/custom.yaml --cfg ../data_bbox/data2/yolov5s.yaml --weights ../data_bbox/data2/ys_best_2020_11_15.pt --batch-size 200 --epochs 16000 --rect --img-size 512 --cache-images --hyp runs/evolve/hyp_evolved.yaml --resume

Scanning images: 100% 1455/1455 [21:11<00:00,  1.14it/s]
Scanning labels ../data_bbox/data2/labels.cache (1455 found, 0 missing, 0 empty, 0 duplicate, for 1455 images): 1455it [00:00, 12209.94it/s]
Caching images (0.6GB): 100% 1455/1455 [01:30<00:00, 16.12it/s]
Scanning images: 100% 99/99 [01:29<00:00,  1.11it/s]
Scanning labels ../data_bbox/data2/labels.cache (99 found, 0 missing, 0 empty, 0 duplicate, for 99 images): 99it [00:00, 11718.25it/s]
Caching images (0.0GB): 100% 99/99 [00:03<00:00, 26.93it/s]
Image sizes 512 train, 512 test
Using 2 dataloader workers
Logging results to runs/exp4
Starting training for 16000 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
9780/15999     13.9G    0.2292   0.02814   0.03342    0.2907       298       512:  50% 4/8 [00:05<00:05,  1.37s/it]
Traceback (most recent call last):
  File "train.py", line 460, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 278, in train
    scaler.step(optimizer)  # optimizer.step
  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0
glenn-jocher commented 3 years ago

gofugoo commented 3 years ago

@glenn-jocher yes , i have tried to download the newest codes of master branch, when training with the "--resume" parameter, the same issue will occur.

Using torch 1.7.0+cu101 CUDA:0 (Tesla V100-SXM2-16GB, 16130MB)

Namespace(adam=False, batch_size=200, bucket='', cache_images=True, cfg='', data='../data_bbox/data2/custem.yaml', device='', epochs=16000, evolve=False, exist_ok=False, global_rank=-1, hyp='hyps/hyp_evolved.yaml', image_weights=False, img_size=[512, 512], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=True, resume=True, save_dir='runs/train/exp', single_cls=False, sync_bn=False, total_batch_size=200, weights='./runs/train/exp/weights/last.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
2020-11-17 09:22:20.830952: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Hyperparameters {'lr0': 0.0121, 'lrf': 0.219, 'momentum': 0.94, 'weight_decay': 0.00043, 'warmup_epochs': 2.19, 'warmup_momentum': 0.95, 'warmup_bias_lr': 0.0836, 'box': 0.0644, 'cls': 0.52, 'cls_pw': 0.811, 'obj': 0.947, 'obj_pw': 1.48, 'iou_t': 0.2, 'anchor_t': 4.53, 'anchors': 4.68, 'fl_gamma': 0.0, 'hsv_h': 0.0124, 'hsv_s': 0.798, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.119, 'scale': 0.515, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 0.522, 'mixup': 0.0}

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1     35960  models.yolo.Detect                      [3, [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [128, 256, 512]]
Model Summary: 283 layers, 7274872 parameters, 7274872 gradients

Transferred 362/370 items from ./runs/train/exp/weights/last.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
wandb: Currently logged in as: googled (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.10
wandb: Resuming run exp
wandb: ⭐️ View project at https://wandb.ai/googled/YOLOv5
wandb: 🚀 View run at https://wandb.ai/googled/YOLOv5/runs/1va6x8kd
wandb: Run data is saved locally in /content/drive/My Drive/dpworkspace/yolov5-master/wandb/run-20201117_092228-1va6x8kd
wandb: Run `wandb off` to turn off syncing.

Scanning images: 100%|██████████| 1455/1455 [00:02<00:00, 491.05it/s]
Scanning labels ../data_bbox/data2/labels.cache (1455 found, 0 missing, 0 empty, 0 duplicate, for 1455 images): 1455it [00:00, 9963.03it/s]
Caching images (0.6GB): 100%|██████████| 1455/1455 [00:30<00:00, 48.16it/s]
Scanning images: 100%|██████████| 99/99 [00:00<00:00, 361.75it/s]
Scanning labels ../data_bbox/data2/labels.cache (99 found, 0 missing, 0 empty, 0 duplicate, for 99 images): 99it [00:00, 5609.93it/s]
Caching images (0.0GB): 100%|██████████| 99/99 [00:03<00:00, 31.84it/s]
Image sizes 512 train, 512 test
Using 2 dataloader workers
Logging results to runs/train/exp
Starting training for 16000 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  20/15999     13.6G    0.2277   0.02139   0.03305    0.2822       356       512:  38%|███▊      | 3/8 [00:05<00:13,  2.73s/it]Traceback (most recent call last):
  File "train.py", line 490, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 296, in train
    scaler.step(optimizer)  # optimizer.step
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
    retval = optimizer.step(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (24) must match the size of tensor b (40) at non-singleton dimension 0
glenn-jocher commented 3 years ago

@gofugoo FYI --resume accepts zero additional arguments. Your only option when using it are:

python train.py --resume  # from most recent last.pt
python train.py --resume path/to/last.pt
gofugoo commented 3 years ago

@gofugoo FYI --resume accepts zero additional arguments. Your only option when using it are:

python train.py --resume  # from most recent last.pt
python train.py --resume path/to/last.pt

thanks for reminding me, training without "--resume" seems to be fine, but it cannot work with the "--resume" parameter. Anything else to me?

github-actions[bot] commented 3 years ago

