ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.66k stars 16.33k forks source link

Amount of data #1532

Closed flixmk closed 3 years ago

flixmk commented 3 years ago

❔Question

Hi, I am trying to train the medium network on a single class dataset. Since the intra-class varation seems to be pretty high i thought a lot of data might be a good idea. I was using 25k examples and it did work okay-ish. But definitely better than with less data. The training time was good imo. Now i wanted to step up my game a bit by using 50k examples to see where that leads me but my assumption of the network taking twice as long was wrong. It runs for almost 2 hours without starting the first epoch. Is there something to be aware of?

Additional context

Command: python train.py --img 416 --batch 32 --epochs 50 --data cell.yaml --cfg yolov5m.yaml --weights ./weights/yolov5m.pt --cache --single-cls My hardware: GPU: Nvidia P100 16GB CPU: intel Xeon e5 2690 v4 RAM: 112GB

After a while the cpu usage is down to 2-3% and the ram is stable, but the level differs from run to run. The logs:

Scanning images: 100%|##########| 53088/53088 [03:10<00:00, 278.52it/s]
Scanning labels ..\yolov5-master\cellData\train.cache (52896 found, 0 missing, 192 empty, 0 duplicate, for 53088 images): 53088it [00:04, 11023.47it/s]
Caching images (27.6GB): 100%|##########| 53088/53088 [01:45<00:00, 501.54it/s]

The amount of samples in one image varies. There are some with 0 and some with 10-15 (Objects are relatively small).

I also use wandb.ai to have those graphs to controll. GPU Memory allocated is very low and not constant while GPU temp is also pretty low + GPU Itilization is 0%.

I dont know what is wrong, since it trained without problems with half the data. Is this normal with that amount of data?

glenn-jocher commented 3 years ago

@Kraufel I noticed on a large dataset today that training was waiting a bit before each epoch, sometimes up to several minutes with low cpu utilization. I found that the InfiniteDataloader() class used in datasets.py may be the source of the slowdown, as when I replaced it with the default torchloader the waiting time before each epoch disappeared. On smaller datasets I observed the opposite though, with the InfiniteDataloader() producing faster training times.

You can try to swap the default torch.utils.data.Dataloader() back in to see if that helps here: https://github.com/ultralytics/yolov5/blob/c9798ae0e1023d34ecc3055097831fe1d51ca84d/utils/datasets.py#L75-L82

flixmk commented 3 years ago

@glenn-jocher I changed the dataloader in the datasets-py and it also changed something about the output i get.

Scanning labels ..\yolov5-master\cellData\train.cache (52896 found, 0 missing, 192 empty, 0 duplicate, for 53088 images): 53088it [00:04, 11478.70it/s]
Caching images (27.6GB): 100%|##########| 53088/53088 [01:18<00:00, 678.73it/s]
Scanning labels ..\yolov5-master\cellData\val.cache (1968 found, 0 missing, 0 empty, 0 duplicate, for 1968 images): 1968it [00:00, 12595.43it/s]
Caching images (1.0GB): 100%|##########| 1968/1968 [00:02<00:00, 703.66it/s]
Analyzing anchors... anchors/target = 5.04, Best Possible Recall (BPR) = 0.9985
Image sizes 416 train, 416 test Using 6 dataloader workers Logging results to runs\train\exp17 Starting training for 50 epochs...

Instead of just scanning the training labels and stopping after that it now outputs all the way up to the starting of the training. Wandb.ai still shows a gpu utilization of 0% and after a a decent amount of time i now get an errror:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\*****\anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\*****\anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

CPU and RAM usage now fluctuates over time. What might be the reason for this behaviour?

Also another question: Can i expect the double amount of time for each epoch with double the data or is the correlation different?

glenn-jocher commented 3 years ago

@Kraufel yes, your training time scales directly with your dataset size. There may be something going on with your dataset. If you can train COCO normally (which is a larger dataset), then everything should work fine for smaller datasets. You can try COCO with this command (dataset auto-downloads on demand): python train.py --data coco.yaml

glenn-jocher commented 3 years ago

@Kraufel the fault may also lie in your environment. You may want to try this in a Docker container to guarantee a verified environment.

flixmk commented 3 years ago

@glenn-jocher First of all, thank you for your fast replies. Really helps out :D

I have tried the coco dataset. And here both dataloaders get the same result as the torch.utils.data.Dataloader() for my dataset. How can my 25k sample dataset run without problems and the 50k doesnt since both are augmented from the same raw data? I am using a Microsoft Azure VM and as far as I know Docker doesnt seem work on them. Is there another possibility to get the same environment on the VM? Also CPU with coco is at 100% but still cant get to the first epoch.

glenn-jocher commented 3 years ago

Docker should work in all environments, that's their entire reason for being. This repo is verified working correclty in all of the following environments. You may want to start there while you debug your own environment.

It appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

flixmk commented 3 years ago

@glenn-jocher So i created a new environment and got all the requirements. I ran the coco dataset with: python train.py --data coco.yaml and the labels.jpg and the train_batch.jpg´s are getting created. But then it breaks at:

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     0/299     5.28G   0.04127   0.06162   0.01619    0.1191       160       640:   0%|                                                            | 7/7393 [00:25<7:56:41,  3.87s/it]

for quite a while and when it eventually starts back up it would need approx. 1hr for 1 epoch (~1.9it/s). I got the latest update on the repo but the vanilla dataloader is in the datasets.py now from the start. So the break should not be the fault of the dataloader right? Also what are trainingspeeds that can be excpected by a P100 16GB and ~100GB RAM and a Intel Xeon E5-2690 v4?

I tried with my own dataset too. It doesn´t even start the training (RAM is always at >90%).But since the coco works (very slowly) I don´t know where to problem is. The large dataset, that does not work is from the exactly same raw data then the smaller working one, just rotated and inverted in a few other ways. So there shouldnt be any differences between my own 2 datasets imo.

Thanks in advance for your time (again).

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sounakdey commented 3 years ago

I am facing the same problem of very slow train epoch time... and i think it arises from --single-cls usage..... because without the --single-cls parameter... its pretty fine... @Kraufel how did you manage to solve it