Dataloader shuffling bug with multiple workers

kimbayashanologatiniburgettoilalalourvr commented 3 years ago

Hi there,

I wanted to report and also solicit some thoughts on a dataloader bug that occurs when spinning up a number of workers that is greater than 1. Because the parent Python process is forked, all of the prng state is copied across workers. Having multiple workers is almost a requirement for training at a decent pace due to the work being spread out across the multiple various CPU threads.

Additionally, if the user has set the threads to restart at every epoch, then the issue is made worse because the numbers are all restarted from the same seed again, potentially the same seed/state as from the last epoch, if no changes are made in the parent process to the random calls before forking again.

This issue is detailed in https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/ in more detail. A fix would be to insert a worker_init_fn at the codeline below. PyTorch Lightning has a great worker init function that you may reference if interested, licensing issues aside. There is also some work in an original thread that led to the PL implementation, but the PL implementation I've found easier to understand and use due to some of their (hopefully appropriate) coding around some of the in-practice 'gotchas' encountered along the way.

https://github.com/ultralytics/yolov5/blob/2993c3fa7af7a76dd82349e3cf85e35e4254576b/utils/datasets.py#L120

This issue gets worse the more workers someone has, causing certainly non-deterministic behavior that can get worse the more workers someone has open. This also can be particularly difficult to debug due to multiple batches in a row not being something that is visualized in Tensorboard. It is not as bad as having shuffling off by default, but it still can cause issues.

glenn-jocher commented 3 years ago

@kimbayashanologatiniburgettoilalalourvr hi, thank you for your feature suggestion on how to improve YOLOv5 🚀!

The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance.

Please see our ✅ Contributing Guide to get started.

To investigate any issues I plotted batch 0 for epochs 0, 1 and 2. I don't see anything that would indicate a problem.

batch 0 epoch 0

train_batch0_epoch0

batch 0 epoch 1

train_batch0_epoch1

batch 0 epoch 2

train_batch0_epoch2

kimbayashanologatiniburgettoilalalourvr commented 3 years ago

Hi Glenn,

Many thanks for taking a look into this.

Would you have the config you used to produce these images? If I have time, I can see if I can reproduce on my end to see if the reported bug comes up.

Best,

kimbaya&etc

kimbayashanologatiniburgettoilalalourvr commented 3 years ago

Alrighty,

Taking a closer look through the code, I think there may be 2 things here. One would be that random.random is used far more than np.random, which I think here is a good thing since my understanding is the state for that generator is shared -- only np.random is causing some issues.

The other would be that the prefetch factor should automatically be set to 2 per epoch, so some batches may get prefetched out of order/get out of order due to race conditions.

The HSV augmentation not being entirely random in the face of other things may not be worth pursuing deep investigation into this. I think from our end, due to the potential for bugs in the future should someone add an np.random-type call in our code, we'll probably try to keep good worker init around just to shield that off alright in the end there.

glenn-jocher commented 3 years ago

@kimbayashanologatiniburgettoilalalourvr hi there! Pinging you to see if you had come up with a solution to this or if it was still an issue.

I think we'll want to implement shuffle in master under some conditions. In particular we are looking at adding an option to update the dataloader/augmentation during late training in #5046, and this will run into the same unshuffled data issue you noticed which will be more pronounced if mosaic is reduced or eliminated.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

glenn-jocher commented 2 years ago

@kimbayashanologatiniburgettoilalalourvr good news 😃! Your original issue may now be fixed ✅ in PR #5623 by @werner-duvaud. This PR turns on shuffling in the YOLOv5 training DataLoader by default, which was missing until now. This works for all training formats: CPU, Single-GPU, Multi-GPU DDP.

train_loader, dataset = create_dataloader(train_path, imgsz, batch_size // WORLD_SIZE, gs, single_cls,
                                          hyp=hyp, augment=True, cache=opt.cache, rect=opt.rect, rank=LOCAL_RANK,
                                          workers=workers, image_weights=opt.image_weights, quad=opt.quad,
                                          prefix=colorstr('train: '), shuffle=True)  # <--- NEW

I evaluated this PR against master on VOC finetuning for 50 epochs, and the results show a slight improvement in most metrics and losses, particularly in objectness loss and mAP@0.5, perhaps indicating that the shuffle addition may help delay overtraining.

https://wandb.ai/glenn-jocher/VOC

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

ultralytics / yolov5