johnolafenwa / deepstack-trainer

Custom Object Detection Training for DeepStack
GNU General Public License v3.0
20 stars 12 forks source link

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation. #10

Closed marrobHD closed 3 years ago

marrobHD commented 3 years ago

I used your google colab. Command: !python3 train.py --dataset-path "/content/test" --model "yolov5x"

Error:

Using torch 1.8.0+cu101 CUDA:0 (Tesla K80, 11441MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov5s.yaml', classes='', data={'train': '/content/test/train', 'val': '/content/test/test', 'nc': 3, 'names': ['Volkswagen', 'DeutschePost', 'DHL']}, dataset_path='/content/test', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, model='yolov5s', multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='train-runs/test', rect=False, resume=False, save_dir='train-runs/test/exp3', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir train-runs/test", view at http://localhost:6006/
2021-03-19 15:56:03.959323: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Downloading https://github.com/ultralytics/yolov5/releases/download/v3.1/yolov5s.pt to yolov5s.pt...
100% 14.5M/14.5M [00:00<00:00, 22.4MB/s]

Overriding model.yaml nc=80 with nc=3

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1     21576  models.yolo.Detect                      [3, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Traceback (most recent call last):
  File "train.py", line 530, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 90, in train
    model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device)  # create
  File "/content/deepstack-trainer/deepstack-trainer/deepstack-trainer/deepstack-trainer/models/yolo.py", line 96, in __init__
    self._initialize_biases()  # only run once
  File "/content/deepstack-trainer/deepstack-trainer/deepstack-trainer/deepstack-trainer/models/yolo.py", line 151, in _initialize_biases
    b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
marcokloeckler commented 3 years ago

I got the same problem. I fixed it by installing an older version of PyTorch:

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Maybe you have to adjust the CUDA version of the python packages, but this fixed the issue for me!

marrobHD commented 3 years ago

Thank you very much! The first part worked flawlessly, but after the first epoch the following errors occurred:

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
   298/299     9.25G   0.04881   0.01046  0.003113   0.06239        14       640: 100% 2/2 [00:01<00:00,  1.79it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 1/1 [00:00<00:00,  4.11it/s]
                 all          16           0           0           0           0           0

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
   299/299     9.25G   0.05258   0.01181   0.00305   0.06744        19       640: 100% 2/2 [00:00<00:00,  2.12it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 1/1 [00:00<00:00,  1.34it/s]
                 all          16           0           0           0           0           0
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:194: RuntimeWarning: All-NaN slice encountered
  vmin = np.nanmin(calc_data)
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:199: RuntimeWarning: All-NaN slice encountered
  vmax = np.nanmax(calc_data)
Exception in thread Thread-614:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/content/deepstack-trainer/utils/plots.py", line 122, in plot_images
    colors = color_list()  # list of colors
  File "/content/deepstack-trainer/utils/plots.py", line 32, in color_list
    return [hex2rgb(h) for h in plt.rcParams['axes.prop_cycle'].by_key()['color']]
  File "/content/deepstack-trainer/utils/plots.py", line 32, in <listcomp>
    return [hex2rgb(h) for h in plt.rcParams['axes.prop_cycle'].by_key()['color']]
  File "/content/deepstack-trainer/utils/plots.py", line 30, in hex2rgb
    return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))
  File "/content/deepstack-trainer/utils/plots.py", line 30, in <genexpr>
    return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))
TypeError: int() can't convert non-string with explicit base

Optimizer stripped from train-runs/dataset/exp/weights/last.pt, 43.4MB
Optimizer stripped from train-runs/dataset/exp/weights/best.pt, 43.4MB
300 epochs completed in 0.711 hours.
marcokloeckler commented 3 years ago

Glad to hear it worked! What start parameters did you use? Because with the standard parameter for epochs you will train 300 epochs - which you did according to the first two lines of your code snippet:


    Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  298/299     9.25G   0.04881   0.01046  0.003113   0.06239        14       640: 100% 2/2 [00:01<00:00,  1.79it/s]
marrobHD commented 3 years ago

Here the full log: https://0bin.net/paste/42IzMhmE#FhlOBA6PMJhWlSeYx53Drp6IIQlN9fLU-wTkOf8/Gxs Ive cut of the stuff before 298 out there. Deepstack still wont detect my trained logos and objects. I dont know if its exception in thread Thread-614 fault.

marcokloeckler commented 3 years ago

Reading lines 47 and 48 of your attached log file, I notice that deep-stack does not recognize labels here. You need a label file here in the training set as well as in the validation set for each image. Also there must be a 'classes.txt' in each of the two folders.

To shorten the training time (until everything runs smoothly) you can call the deep-stack-trainer with the following start parameters (per default you will run for 300 epochs):

python3 train.py --dataset-path "/content/dataset" --epochs 30

marrobHD commented 3 years ago

Thank you, everything works now as intended. I forgot to include the .txt files in the test directory. I'll close this now

git-Cade commented 2 years ago

Is there any way to resolve this error and use Cuda 11.1+? Pytorch 1.7.1 is not compatible with the Ampere GPUs.

sipvoip commented 2 years ago

I have the same issue, any way to use this with the latest cuda and pytorch?