ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.6k stars 16.1k forks source link

Problems with the --multi-scale option with CUDA #7678

Closed DP1701 closed 2 years ago

DP1701 commented 2 years ago

Search before asking

YOLOv5 Component

Training

Bug

Training does not take place if the --multi-scale option is activated. Stops directly in the first epoch at the beginning.

(YOLOv5_enviroment) userA@dgx:~/yolov5$ python train.py --multi-scale

train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 βœ…
YOLOv5 πŸš€ v6.1-171-gc4862fc torch 1.11.0+cu113 CUDA:0 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 πŸš€ runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
WARNING: DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started.
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|β–ˆ
Plotting labels to runs/train/exp2/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset βœ…
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp2
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/8 [00:00<?, ?it/s]                                                                                                                                          
Traceback (most recent call last):
  File "train.py", line 668, in <module>
    main(opt)
  File "train.py", line 563, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 349, in train
    pred = model(imgs)  # forward
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
    res = scatter_map(inputs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
``

### Environment

YOLOv5 πŸš€ v6.1-170-gbff6e51 torch 1.11.0+cu113 CUDA:0 (A100-SXM4-40GB, 40537MiB) Python 3.8.10

pip list

Package Version


absl-py 1.0.0
albumentations 1.1.0
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.9
cycler 0.11.0
fonttools 4.28.3
google-auth 2.3.3
google-auth-oauthlib 0.4.6
grpcio 1.42.0
idna 3.3
imageio 2.13.3
importlib-metadata 4.8.2
joblib 1.1.0
kiwisolver 1.3.2
Markdown 3.3.6
matplotlib 3.5.1
networkx 2.6.3
numpy 1.21.4
oauthlib 3.1.1
opencv-python 4.5.4.60
opencv-python-headless 4.5.4.60
packaging 21.3
pandas 1.3.5
Pillow 8.4.0
pip 20.0.2
pkg-resources 0.0.0
protobuf 3.19.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.6
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
qudida 0.0.4
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.8
scikit-image 0.19.0
scikit-learn 1.0.1
scipy 1.7.3
seaborn 0.11.2
setuptools 44.0.0
six 1.16.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
thop 0.0.31.post2005241907 threadpoolctl 3.0.0
tifffile 2021.11.2
torch 1.11.0+cu113
torchaudio 0.11.0+cu113
torchvision 0.12.0+cu113
tqdm 4.62.3
typing-extensions 4.0.1
urllib3 1.26.7
Werkzeug 2.0.2
wheel 0.37.0
zipp 3.6.0


nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Thu_Feb_10_18:23:41_PST_2022 Cuda compilation tools, release 11.6, V11.6.112 Build cuda_11.6.r11.6/compiler.30978841_0

Ubuntu 20.04.3 LTS



### Minimal Reproducible Example

python train.py --multi-scale

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!
glenn-jocher commented 2 years ago

@DP1701 your error message clearly states RuntimeError: CUDA error: out of memory.

YOLOv5 πŸš€ can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

Screenshot 2021-05-28 at 12 19 51

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Screenshot 2021-11-06 at 12 31 10

Good luck πŸ€ and let us know if you have any other questions!

DP1701 commented 2 years ago

@glenn-jocher With --batch -1 --epochs 10 GPU A100 device 6 (Memory is empty and no calculation takes place on it)

Bildschirmfoto 2022-05-03 um 09 16 53
(YOLOv5_enviroment) userA@dgx:~/YOLO_detectors/yolov5_new/yolov5$ python train.py --batch -1 --epochs 10 --multi-scale --device 6
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=6, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 βœ…
YOLOv5 πŸš€ v6.1-172-ge305aba torch 1.11.0+cu113 CUDA:6 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 πŸš€ runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (A100-SXM4-40GB) 39.59G total, 0.07G reserved, 0.05G allocated, 39.47G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7235389       16.53         0.281         22.84         14.51        (1, 3, 640, 640)                    list
     7235389       33.06         0.476         23.82         14.13        (2, 3, 640, 640)                    list
     7235389       66.13         0.883          23.1         14.99        (4, 3, 640, 640)                    list
     7235389       132.3         1.739         24.25         17.83        (8, 3, 640, 640)                    list
     7235389       264.5         3.347         36.36         28.69       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 172 for CUDA:0 35.63G/39.59G (90%)
Scaled weight_decay = 0.0013437500000000001
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?i
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?it/
Plotting labels to runs/train/exp10/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset βœ…
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     35.2G   0.04462   0.05522   0.01507      1686       704: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.85s/it]                                                                                           
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.45s/it]                                                                           
                 all        128        929      0.669      0.661      0.712      0.475

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     39.2G   0.04532   0.04671   0.01566      1492       736: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  2.00it/s]                                                                                           
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.45s/it]                                                                           
                 all        128        929      0.701      0.631      0.703       0.46

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     39.2G   0.05562    0.2054   0.03119      1816       352: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  6.71it/s]                                                                                           
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.38s/it]                                                                           
                 all        128        929      0.708      0.632      0.701      0.457

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     39.2G   0.04507   0.07174   0.01706      1529       576: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  3.15it/s]                                                                                           
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  2.53it/s]                                                                           
                 all        128        929      0.737      0.623      0.706      0.462

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/1 [00:00<?, ?it/s]                                                                                                                                                                           
Traceback (most recent call last):
  File "train.py", line 668, in <module>
    main(opt)
  File "train.py", line 563, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 349, in train
    pred = model(imgs)  # forward
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 135, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 158, in _forward_once
    x = m(x)  # run
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 57, in forward
    x[i] = self.m[i](x[i])  # conv
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

With --batch 8 --epochs 10

(YOLOv5_enviroment) userA@dgx:~/YOLO_detectors/yolov5_new/yolov5$ python train.py --batch 8 --epochs 10 --multi-scale --device 6
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=6, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 βœ…
YOLOv5 πŸš€ v6.1-172-ge305aba torch 1.11.0+cu113 CUDA:6 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 πŸš€ runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?i
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?it/
Plotting labels to runs/train/exp12/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset βœ…
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp12
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     3.82G   0.04653   0.07317   0.02241       119       448: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:03<00:00,  4.15it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.67it/s]                                                                           
                 all        128        929      0.662      0.685       0.72      0.471

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     4.05G   0.04652   0.07795    0.0223       113       640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 14.66it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.19it/s]                                                                           
                 all        128        929      0.805      0.618      0.735      0.475

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     4.29G   0.04754   0.08584   0.01922        84       928: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 14.69it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.48it/s]                                                                           
                 all        128        929      0.743      0.641      0.717      0.457

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     4.29G   0.04975   0.08548   0.01883       119       832: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.61it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.16it/s]                                                                           
                 all        128        929      0.739      0.653      0.714      0.427

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       4/9     4.29G    0.0512   0.06284   0.02203        69       384: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 14.97it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 11.38it/s]                                                                           
                 all        128        929      0.653      0.645      0.659      0.354

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       5/9     4.29G   0.05181   0.08603   0.01982        61       448: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.84it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.19it/s]                                                                           
                 all        128        929      0.734      0.666      0.725      0.394

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       6/9     4.29G   0.05344   0.07487   0.01788       125       800: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.45it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.91it/s]                                                                           
                 all        128        929      0.703      0.627      0.694      0.393

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       7/9     4.29G   0.05734   0.07648   0.01843        76       544: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.83it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.70it/s]                                                                           
                 all        128        929      0.692      0.597      0.672      0.375

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       8/9     4.29G   0.05089   0.09513    0.0185        83       416: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.81it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.99it/s]                                                                           
                 all        128        929      0.777       0.65      0.733      0.455

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       9/9     4.29G   0.05116   0.09732   0.01881       153       800: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:01<00:00, 15.54it/s]                                                                                         
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 14.55it/s]                                                                           
                 all        128        929      0.714      0.703      0.755      0.478

10 epochs completed in 0.006 hours.
Optimizer stripped from runs/train/exp12/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp12/weights/best.pt, 14.8MB
...
glenn-jocher commented 2 years ago

@DP1701 πŸ‘‹ hi, thanks for letting us know about this possible problem with YOLOv5 πŸš€. If batch -1 causes issues then I'd suggest you don't use it.

Symbadian commented 1 year ago

Hi @DP170, I am running a GPU can you guide me on how you initiated the GPU device, please? was there something in the code that you amended and which aspect of the code was this?

I am trying to run my GPU via a Linux server and it's proving extremely challenging!

Thanx for acknowledging my digital presence in advance

DP1701 commented 1 year ago

HI @Symbadian,

You don't have to change anything in the code. The following command in the terminal is sufficient:

(For mulit-gpu training)

python3 -m torch.distributed.launch --nproc_per_node NUMBER_OF_GPUs train.py --data path_to_your_data --img image_size --weights weights --batch batch_size --epochs number_of_epochs --device number_of_devices

The information in bold is information to be provided. Important: The stack size must be greater than 0 when using multi-GPU training.

It is important that you have installed Pytorch with CUDA support. Otherwise it will not work.

Symbadian commented 1 year ago

@DP1701 hi pal, I wish that was the case! this is my error with the torch library and I am not sure how to solve this!


ERROR: Could not find a version that satisfies the requirement torchvision>=0.8.1 (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3)
ERROR: No matching distribution found for torchvision>=0.8.1
requirements: Command 'pip install "torchvision>=0.8.1" ' returned non-zero exit status 1.
YOLOv5 πŸš€ v6.2-61-gffbce385 Python-3.10.8 torch-1.12.1 CPU```
Symbadian commented 1 year ago

@DP1701 I ran this code and it came back with launch.py: error: unrecognized arguments: --nproc_per_nodeΒ 4Β train.py --dataΒ coco128.yamlΒ --img where can I find the repo download that you are using? Can you guide me to such, please?

DP1701 commented 1 year ago

What packages do you have installed?

pip3 list

Try this:

python3 train.py 

Does it work?

Symbadian commented 1 year ago

@DP1701 hey pal, I manage to install the necessary pytorch


Package                 Version
----------------------- --------------------
absl-py                 1.4.0
aiohttp                 3.8.3
aiosignal               1.3.1
asttokens               2.2.1
async-timeout           4.0.2
attrs                   22.2.0
backcall                0.2.0
blinker                 1.5
Bottleneck              1.3.5
brotlipy                0.7.0
cachetools              5.2.1
certifi                 2022.12.7
cffi                    1.15.1
charset-normalizer      3.0.1
click                   8.0.4
colorama                0.4.6
contourpy               1.0.5
cryptography            38.0.4
cycler                  0.11.0
decorator               5.1.1
executing               1.2.0
flit_core               3.6.0
fonttools               4.25.0
frozenlist              1.3.3
future                  0.18.2
google-auth             2.15.0
google-auth-oauthlib    0.4.6
grpcio                  1.42.0
idna                    3.4
importlib-metadata      6.0.0
ipython                 8.8.0
jedi                    0.18.2
kiwisolver              1.4.4
Markdown                3.4.1
MarkupSafe              2.1.1
matplotlib              3.6.2
matplotlib-inline       0.1.6
multidict               6.0.2
munkres                 1.1.4
numexpr                 2.8.4
numpy                   1.23.5
oauthlib                3.2.2
opencv-python           4.7.0.68
packaging               23.0
pandas                  1.5.2
parso                   0.8.3
patsy                   0.5.3
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  9.3.0
pip                     22.3.1
prompt-toolkit          3.0.36
protobuf                3.20.1
psutil                  5.9.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyasn1                  0.4.8
pyasn1-modules          0.2.7
pycparser               2.21
Pygments                2.14.0
PyJWT                   2.6.0
pyOpenSSL               23.0.0
pyparsing               3.0.9
PySocks                 1.7.1
python-dateutil         2.8.2
pytz                    2022.7
pyu2f                   0.1.5
PyYAML                  6.0
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
scipy                   1.9.3
seaborn                 0.12.2
setuptools              65.6.3
six                     1.16.0
stack-data              0.6.2
statsmodels             0.13.2
tensorboard             2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
thop                    0.1.1.post2209072238
torch                   1.12.1
torchvision             0.1.8
tqdm                    4.64.1
traitlets               5.8.1
typing_extensions       4.4.0
urllib3                 1.26.14
wcwidth                 0.2.6
Werkzeug                2.2.2
wheel                   0.37.1
yarl                    1.8.1
zipp                    3.11.0```
Symbadian commented 1 year ago

and I ran the next bits (python3 train.py) I got a new error


Traceback (most recent call last):
  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 42, in <module>
    import val as validate  # for end-of-epoch mAP
  File "/home/MattCCTV/YOLO9Classes/yolov5/val.py", line 37, in <module>
    from models.common import DetectMultiBackend
  File "/home/MattCCTV/YOLO9Classes/yolov5/models/common.py", line 23, in <module>
    from utils.dataloaders import exif_transpose, letterbox
  File "/home/MattCCTV/YOLO9Classes/yolov5/utils/dataloaders.py", line 31, in <module>
    from utils.augmentations import (Albumentations, augment_hsv, classify_albumentations, classify_transforms, copy_paste,
  File "/home/MattCCTV/YOLO9Classes/yolov5/utils/augmentations.py", line 12, in <module>
    import torchvision.transforms.functional as TF
ModuleNotFoundError: No module named 'torchvision.transforms.functional'; 'torchvision.transforms' is not a package```

I am now trying to find out what that is and how to solve this...I am not sure this is problematic!!!
In the read-me file, I followed all the instructions line by line...

and still, these errors persist...
DP1701 commented 1 year ago

Uninstall torch and torchvision.

Then type in:

pip3 install torch torchvision
Symbadian commented 1 year ago

still pal, $ python3 train.py Traceback (most recent call last): File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 42, in <module> import val as validate # for end-of-epoch mAP File "/home/MattCCTV/YOLO9Classes/yolov5/val.py", line 37, in <module> from models.common import DetectMultiBackend File "/home/MattCCTV/YOLO9Classes/yolov5/models/common.py", line 23, in <module> from utils.dataloaders import exif_transpose, letterbox File "/home/MattCCTV/YOLO9Classes/yolov5/utils/dataloaders.py", line 31, in <module> from utils.augmentations import (Albumentations, augment_hsv, classify_albumentations, classify_transforms, copy_paste, File "/home/MattCCTV/YOLO9Classes/yolov5/utils/augmentations.py", line 13, in <module> import torchvision.transforms.functional as TF ModuleNotFoundError: No module named 'torchvision.transforms.functional'; 'torchvision.transforms' is not a package

DP1701 commented 1 year ago

Do you still have torchvision 0.1.8 installed?

Symbadian commented 1 year ago

Yes I do!! But it’s still giving me and an error..

Cheers to positive perspectives!

Matt.


From: DP1701 @.> Sent: Monday, January 16, 2023 11:17:35 AM To: ultralytics/yolov5 @.> Cc: Symbadian @.>; Mention @.> Subject: Re: [ultralytics/yolov5] Problems with the --multi-scale option with CUDA (Issue #7678)

Do you still have torchvision 0.1.8 installed?

β€” Reply to this email directly, view it on GitHubhttps://github.com/ultralytics/yolov5/issues/7678#issuecomment-1383892599, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL7WSHNQ7OKNP4M7AXKF6R3WSUU47ANCNFSM5U57YXLA. You are receiving this because you were mentioned.Message ID: @.***>

DP1701 commented 1 year ago

Install this:

pip3 install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

But uninstall torch and torchvision beforehand. Torchvision 0.1.8 is out of date.

Symbadian commented 1 year ago

Ahh I see.. ok will trying that..

Cheers to positive perspectives!

Matt.


From: DP1701 @.> Sent: Monday, January 16, 2023 11:54:04 AM To: ultralytics/yolov5 @.> Cc: Symbadian @.>; Mention @.> Subject: Re: [ultralytics/yolov5] Problems with the --multi-scale option with CUDA (Issue #7678)

Install this:

pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

But uninstall torch and torchvision beforehand. Torchvision 0.1.8 is out of date.

β€” Reply to this email directly, view it on GitHubhttps://github.com/ultralytics/yolov5/issues/7678#issuecomment-1383938565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL7WSHNGL3SAZP6LXPBVA4DWSUZFZANCNFSM5U57YXLA. You are receiving this because you were mentioned.Message ID: @.***>

Symbadian commented 1 year ago

pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116

ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu116 (from versions: none)

ERROR: No matching distribution found for torch==1.12.0+cu116

From: mm gp @.> Date: Monday, 16 January 2023 at 11:57 To: ultralytics/yolov5 @.>, ultralytics/yolov5 @.> Cc: Mention @.> Subject: Re: [ultralytics/yolov5] Problems with the --multi-scale option with CUDA (Issue #7678) Ahh I see.. ok will trying that..

Cheers to positive perspectives!

Matt.


From: DP1701 @.> Sent: Monday, January 16, 2023 11:54:04 AM To: ultralytics/yolov5 @.> Cc: Symbadian @.>; Mention @.> Subject: Re: [ultralytics/yolov5] Problems with the --multi-scale option with CUDA (Issue #7678)

Install this:

pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

But uninstall torch and torchvision beforehand. Torchvision 0.1.8 is out of date.

β€” Reply to this email directly, view it on GitHubhttps://github.com/ultralytics/yolov5/issues/7678#issuecomment-1383938565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL7WSHNGL3SAZP6LXPBVA4DWSUZFZANCNFSM5U57YXLA. You are receiving this because you were mentioned.Message ID: @.***>

Symbadian commented 1 year ago

@DP1701 got this error:


Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu116 (from versions: none)
ERROR: No matching distribution found for torch==1.12.0+cu116```
DP1701 commented 1 year ago

Here are all the version:

Link

You have to check which system, which python version, which Cuda version you have installed.

You could also try just:

pip3 install torch==1.12.0
pip3 install torchvision==0.13.0

Python >=3.7, <=3.10 is required

Symbadian commented 1 year ago

@DP1701 Yip, went through those already.. None of them seems to be working for me... Hence, I tried the torchvision 0.1.8 and that seems to be the only one that's working

Symbadian commented 1 year ago

Can you guide me to the repo? So that I can get the latest files for processing.. This cannot be the right way for the installs... I am getting too many errors at this stage..

DP1701 commented 1 year ago

You need at least torchvision>=0.8.1 for YOLOv5.

Symbadian commented 1 year ago

@DP1701 I would have to agree with you here! but it or they rather is not installing no matter what I do! I have been trying to get these installs for two weeks now and every day I am faced with the same challenge!

Maybe I should just update the repo files and be done with it???!!!

DP1701 commented 1 year ago

If by repo files you mean the files from YOLO, then nothing will change. Alternatively, you could try miniforge (Conda): Link.

After the installation, create a new environment with:

conda create --name YOLOv5 python=3.9.13
conda activate YOLOv5

Then install Pytorch with the Conda instruction

DP1701 commented 1 year ago

And then install the rest that YOLOv5 needs with pip

Symbadian commented 1 year ago

@DP1701 Ok will try that now..

Symbadian commented 1 year ago

conda create -name YOLOv5 python=3.9.13

Hi @DP1701 I got this:


Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - yolo9c

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-ppc64le
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-ppc64le
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.```
DP1701 commented 1 year ago

insert a second - character before name:

conda create --name YOLOv5 python=3.9.13
Symbadian commented 1 year ago

ok will try that now it took quite a while to delete the files from the GPU server, my apologies


  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 29, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'```
Symbadian commented 1 year ago

ok will try that now it took quite a while to delete the files from the GPU server, my apologies

  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 29, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'```

may I request some guidance on the torch command that I should apply here, please?

My confidence is a tad low, I thought I had an idea, but it seems like it's more tricky than expected!

Thus far, I have been unsuccessful with my selection of the torch..

DP1701 commented 1 year ago

You must first install torch and torchvision with Conda if you have installed and activated the environment correctly. Take a look at the link to Pytorch that I sent you today.

Symbadian commented 1 year ago

@DP1701 I see but when I check out the compatibility https://github.com/pytorch/pytorch/issues/47776

Torch doesn't work well here??!!!??! WIth this conda create --name YOLOv5 python=3.9.13 ???!!!

SO I tried conda install -c pytorch-lts torchvision

*AND I GOT THE ERROR BELOW**


to be incompatible with the existing python installation in your environment:

Specifications:

  - torchvision -> python[version='>=3.8,<3.9.0a0']

Your python: python==3.9.12

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with your CUDA driver:

  - feature:/linux-ppc64le::__cuda==10.2=0
  - feature:|@/linux-ppc64le::__cuda==10.2=0

Your installed CUDA driver is: 10.2```
Symbadian commented 1 year ago

@DP1701 I see but when I check out the compatibility pytorch/pytorch#47776

Torch doesn't work well here??!!!??! WIth this conda create --name YOLOv5 python=3.9.13 ???!!!

SO I tried conda install -c pytorch-lts torchvision

AND I GOT THE ERROR BELOW*

to be incompatible with the existing python installation in your environment:

Specifications:

  - torchvision -> python[version='>=3.8,<3.9.0a0']

Your python: python==3.9.12

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with your CUDA driver:

  - feature:/linux-ppc64le::__cuda==10.2=0
  - feature:|@/linux-ppc64le::__cuda==10.2=0

Your installed CUDA driver is: 10.2```

I'm in need of some assistance to understand this, please.