ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.64k stars 16.11k forks source link

maximum number of workers for dataloader #715

Closed Ownmarc closed 3 years ago

Ownmarc commented 4 years ago

πŸš€ Feature

A way to specify the maximum number of workers for the dataloader, currently

nw = min([os.cpu_count() // world_size, batch_size if batch_size > 1 else 0, 8])

would become something like

nw = min([os.cpu_count() // world_size, batch_size if batch_size > 1 else 0, max_worker])

where max_worker is an argument we can give when starting a training

Motivation

I started 2 different trainings on 2 GPUs (on Windows) and got some error because both trainings created dataloaders with 8 workers each (I have 8 cores) and then one of the training failled with some multithreading errors! To solve the problem, I went in the create_dataloader() function and hardcoded the number of workers I wanted, but I think it would be great if it could be an argument just like selecting device and etc.

Pitch

--workers n (n = maximum number of workers for dataloader)

I can code it and PR if you agree to adding this

github-actions[bot] commented 4 years ago

Hello @Ownmarc, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@Ownmarc yes this might be a good idea to add to train.py. So the problem went away after you reduced to half or less of total cpus, running two trainings.

Hmm then it seems like it’s assigning workers well to idle cpus then. I wonder if there is a way to determine idle cpu count programmatically to automate this? But yes your idea sounds like a good addition.

NanoCode012 commented 4 years ago

Hi! Since we're on this topic, there was an interesting thread on Pytorch in choosing number of workers for dataloaders.

https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813/3

For the code above, we did os.cpu_count() // world_size as a precaution for a 1 worker to 1 cpu ratio, but we do not know if that's the best option. Some reports n_gpu*4. Some uses 0. What do you think?

glenn-jocher commented 4 years ago

@NanoCode012 in my experiments on GCP I found that 8 workers out of 12 cpus was faster than 8/8 or 12/12, so it appears leaving some cpus free for other tasks (rather than using 100% of them for dataloading) helps speed up training, also I saw little benefit to using more than 10 workers.

I saw the same results on a 2080Ti box, where leaving cpus free instead of using them all resulted in faster training. I also observed that you should never assign more workers than cpus, this will result in very slow training.

Ownmarc commented 4 years ago

Nothing I have been reading on the way to select the number of workers was really convincing. We should probably do our own tests to determine a good general formula, but it could be different for everybody. Here I needed to lower it to prevent crashing my trainings. I do not think 0 is good since its only 1 worker (the main process).

Since each worker is responsible for its own batch, I think that having n_gpu * 2 would be enough if the time it takes to load 1 batch is <= then the time it takes for the GPU to process 1 batch this way, there is always 1 batch queued waiting for the GPU to be ready to take the next batch.

If I had to code a formula, it would be something like n_workers = (ceil( time_to load_1_batch / time_to_process_1_batch ) + 1) * n_gpu

If we can determine time_to_load_1_batch it would factor in most of the difficult things to take into account like the speed of the user's CPU. Thats how I would go if I had to automate it, maybe other people have other insights!

glenn-jocher commented 4 years ago

@Ownmarc yeah that sounds about right. The current implementation is based off of work I was doing on an 80 cpu box with 8 GPUs. It allows for up to 8 per GPU, because I just didn't see any gain beyond that, and also allows for a few free cpus per GPU in addition to the 8 workers, because I observed that sped up training. So I think if I had to come up with a formula it would be 80% of cpus assigned as workers, the remaining 20% free, with a maximum of 8 workers per GPU.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.