ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.64k stars 16.1k forks source link

Multiple GPU Hyperparameter evolution #13254

Open glitchyordis opened 1 month ago

glitchyordis commented 1 month ago

Search before asking

Question

I've performed normal training with multiple GPU with the m6 model on 3 GPUs, and I could fit a batch_size of 57. However, when I tried the multi GPU code for hyperparameter evolution, I could only specify a batch size of about 18.

How does the multi GPU hyperparameter evolution differs from single GPU run? Does it speed up the progress?

Additional

No response

glenn-jocher commented 1 month ago

@glitchyordis hi there,

Thank you for your question and for providing detailed information about your setup. When performing hyperparameter evolution with multiple GPUs, there are a few key differences and considerations compared to standard multi-GPU training:

  1. Resource Allocation: Hyperparameter evolution involves running multiple training experiments in parallel, each with different hyperparameters. This can lead to higher overall resource usage, which might explain why you can only specify a smaller batch size compared to standard training.

  2. Batch Size Distribution: In multi-GPU training, the total batch size is divided evenly across the GPUs. For hyperparameter evolution, each experiment might be treated as a separate training run, which could limit the batch size you can allocate per experiment.

  3. Speed and Efficiency: While hyperparameter evolution can speed up the process of finding optimal hyperparameters by running multiple experiments simultaneously, it can also be more resource-intensive. The speed-up comes from parallelizing the search process, but it doesn't necessarily mean each individual training run will be faster.

To optimize your hyperparameter evolution on multiple GPUs, you might consider the following:

For more detailed guidance, you can refer to the Multi-GPU Training documentation.

If you encounter any issues, please ensure you are using the latest versions of the packages and feel free to share more details so we can assist you further.

Best of luck with your hyperparameter evolution! πŸš€

glitchyordis commented 1 month ago

@glenn-jocher , thanks for the response. If I have trained a model using standard training, what is the difference between running hyperparameter evolution with the base .pt weights file from the YOLOv5 repository and the newly trained model?

glenn-jocher commented 1 month ago

Hi @glitchyordis,

Great question! The difference between running hyperparameter evolution using the base .pt weights from the YOLOv5 repository versus your newly trained model lies primarily in the starting point of the optimization process.

Base .pt Weights from YOLOv5 Repository

Newly Trained Model

Practical Considerations

Example Command

If you decide to use your newly trained model for hyperparameter evolution, you can specify the path to your custom weights file like this:

python train.py --epochs 10 --data coco128.yaml --weights path/to/your_custom_model.pt --cache --evolve

This approach ensures that the evolution process builds upon the specific strengths of your custom-trained model.

Feel free to experiment with both approaches and compare the results to see which one yields the best performance for your specific use case. If you encounter any issues or have further questions, don't hesitate to reach out. Happy evolving! πŸš€

glitchyordis commented 3 weeks ago

Hi @glenn-jocher , I'm running a hyperparam. evolution (see code below). I've enabled Wandb tracking and I expected the runs to end once we reach 300. However, I can observe that the hyperparameter evolution is still running.

Wandb runs tracking: image

evolve directory screen shot: image

exp_number=0
project_dir="runs/evolve/label_evo"
python train.py --project "${project_dir}" --name "exp_${exp_number}"  --epochs 20 --data 2024-05-17-0719.yaml --img 960 --batch-size 30 --weights yolov5m6.pt --cache ram --device 0 --evolve 300
glenn-jocher commented 3 weeks ago

Hi @glitchyordis,

Thank you for your question. It appears that the hyperparameter evolution is still running beyond the expected 300 generations. Please ensure you are using the latest version of YOLOv5. If the issue persists, you might want to check the evolve.csv file for any anomalies or interruptions in the logging process. If you need further assistance, feel free to provide additional details.