ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.08k stars 16.18k forks source link

Training fails to start due to data augmentation errors (opencv) #8894

Closed DLumi closed 2 years ago

DLumi commented 2 years ago

Search before asking

YOLOv5 Component

Training

Bug

When I start training on my custom dataset consisting of '.jpg' images it fails with an opencv error, although according to the preliminary check my images are ok. I have no idea what could go wrong there. Here's my traceback:

Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to runs/train/exp5
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0% 0/27 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 642, in <module>
    main(opt)
  File "train.py", line 537, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 301, in train
    for i, (imgs, targets, paths, _) in pbar:  # batch -------------------------------------------------------------
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/content/gdrive/MyDrive/repos/utils/dataloaders.py", line 167, in __iter__
    yield next(self.iterator)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 652, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 461, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/gdrive/MyDrive/repos/utils/dataloaders.py", line 640, in __getitem__
    augment_hsv(img, hgain=hyp['hsv_h'], sgain=hyp['hsv_s'], vgain=hyp['hsv_v'])
  File "/content/gdrive/MyDrive/repos/utils/augmentations.py", line 61, in augment_hsv
    cv2.cvtColor(im_hsv, cv2.COLOR_HSV2BGR, dst=im)  # no return needed
cv2.error: OpenCV(4.6.0) :-1: error: (-5:Bad argument) in function 'cvtColor'
> Overload resolution failed:
>  - Layout of the output array dst is incompatible with cv::Mat
>  - Expected Ptr<cv::UMat> for argument 'dst'

Environment

I run notebook in colab, here's what I got today:

YOLOv5 🚀 v6.1-322-gd5116bb Python-3.7.13 torch-1.12.0+cu113 CUDA:0 (Tesla T4, 15110MiB) Setup complete ✅ (2 CPUs, 12.7 GB RAM, 37.4/78.2 GB disk)

Minimal Reproducible Example

!python train.py --img 640 --batch 32 --epochs 300 --data gallery.yaml --weights runs/train/exp3/weights/best.pt --freeze 10

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 2 years ago

👋 Hello @DLumi, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

DLumi commented 2 years ago

I've double checked data by running simple opening/conversion to RGB with cv2, and it looks totally fine from my end. Since I train model on somewhat toy dataset, I'll share it to help investigate the problem.

/the link has been removed by the author/

glenn-jocher commented 2 years ago

@DLumi 👋 Hello! Thanks for asking about YOLOv5 🚀 dataset formatting. If your dataset is failing but all of the common datasets (COCO128, COCO, VOC, etc.) are training correctly then the issue lies with your dataset.

To train correctly your data must be in YOLOv5 format. Please see our Train Custom Data tutorial for full documentation on dataset setup and all steps required to start training your first model. A few excerpts from the tutorial:

1.1 Create dataset.yaml

COCO128 is an example small tutorial dataset composed of the first 128 images in COCO train2017. These same 128 images are used for both training and validation to verify our training pipeline is capable of overfitting. data/coco128.yaml, shown below, is the dataset config file that defines 1) the dataset root directory path and relative paths to train / val / test image directories (or *.txt files with image paths), 2) the number of classes nc and 3) a list of class names:

# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/coco128  # dataset root dir
train: images/train2017  # train images (relative to 'path') 128 images
val: images/train2017  # val images (relative to 'path') 128 images
test:  # test images (optional)

# Classes
nc: 80  # number of classes
names: [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
         'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
         'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
         'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
         'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
         'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
         'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
         'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
         'hair drier', 'toothbrush' ]  # class names

1.2 Create Labels

After using a tool like Roboflow Annotate to label your images, export your labels to YOLO format, with one *.txt file per image (if no objects in image, no *.txt file is required). The *.txt file specifications are:

Image Labels

The label file corresponding to the above image contains 2 persons (class 0) and a tie (class 27):

1.3 Organize Directories

Organize your train and val images and labels according to the example below. YOLOv5 assumes /coco128 is inside a /datasets directory next to the /yolov5 directory. YOLOv5 locates labels automatically for each image by replacing the last instance of /images/ in each image path with /labels/. For example:

../datasets/coco128/images/im0.jpg  # image
../datasets/coco128/labels/im0.txt  # label

Good luck 🍀 and let us know if you have any other questions!

DLumi commented 2 years ago

Thanks for spending less than 10 seconds on reading the issue, and thanks for the tutorial, but I've followed it already. I'm pretty sure the data is fine, as I exported my labels from Label Studio specifically in YOLO format. My images are also in the exact folders you mentioned, and the script managed to find them on a preliminary check, so the file structure not the cause of the issue. I double checked the labels, though. And I'm almost certain that I hit some weird and rare augmentation bug, since it seems like this is the place where something goes wrong.

Here you can see that I plotted the de-normalized label on my train image successfully. image

The Y-coordinate is a 0, as seen here: [268 0] [1266 720]

Code for reproduction in case you need it:

import cv2
import numpy as np
import matplotlib.pyplot as plt

image_folder = Path(r'images\train')
label_folder = Path(r'labels\train')

images = [x for x in image_folder.iterdir()]
labels = [x for x in label_folder.iterdir()]

idx = 0

image = cv2.imread(images[idx].as_posix(), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

with open(labels[idx]) as f:
    label = np.asarray(f.readline().split()[1:]).astype('float32')

h, w = image.shape[:2]

c_x = label[0] * w
c_y = label[1] * h
l_w = label[2] * w
l_h = label[3] * h

half_w = l_w // 2
half_h = l_h // 2

print(np.intp([c_x - half_w, c_y - half_h]), np.intp([c_x + half_w, c_y + half_h]))

cv2.rectangle(image, np.intp([c_x - half_w, c_y - half_h]), np.intp([c_x + half_w, c_y + half_h]), color=(0, 255, 0), thickness=3)

plt.figure(figsize=(8, 8))
plt.imshow(image)
plt.axis('off')
plt.show()
glenn-jocher commented 2 years ago

@DLumi thanks for the reproduction code, but it seems this is custom python code and not official YOLOv5 code. If cv2 or matplotlib are producing errors please raise them directly on the relevant repositories.

DLumi commented 2 years ago

@DLumi thanks for the reproduction code, but it seems this is custom python code and not official YOLOv5 code. If cv2 or matplotlib are producing errors please raise them directly on the relevant repositories.

It's the reproduction code of me denormalizing my labels and plotting them on the training image (so you could actually see they are ok). As I mentioned earlier, I only interact with the repo via CLI, so here's the actual training code I used: !python train.py --img 640 --batch 32 --epochs 300 --data gallery.yaml --weights runs/train/exp3/weights/best.pt --freeze 10

I could dive into this wonderful rabbit hole to find out what exactly went wrong, but I'd rather not. Besides, I'm pretty sure you'll be able to get to the bottom of this much faster than I would, especially having source date on your hand.

glenn-jocher commented 2 years ago

@DLumi I'm not able to reproduce any issues using your command with a common dataset.

Screen Shot 2022-08-08 at 7 17 29 PM

We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

For Ultralytics to provide assistance your code should also be:

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

DLumi commented 2 years ago

I decided to take another look at the issue, and tried to pull some recent changes to the repo. The thing is that git told me that I have some local changes (!) in the dataloader and some other files. That's pretty weird as I never modified the source code (I guess config files don't count?). Anyways, once I reset and updated the repo everything started working again. Thanks for your help.

glenn-jocher commented 2 years ago

@DLumi great, glad to hear it! Let us know if you run into any other issues or think of new useful feature :)