Open glitchyordis opened 1 month ago
@glitchyordis hi there,
Thank you for your question and for providing detailed information about your setup. When performing hyperparameter evolution with multiple GPUs, there are a few key differences and considerations compared to standard multi-GPU training:
Resource Allocation: Hyperparameter evolution involves running multiple training experiments in parallel, each with different hyperparameters. This can lead to higher overall resource usage, which might explain why you can only specify a smaller batch size compared to standard training.
Batch Size Distribution: In multi-GPU training, the total batch size is divided evenly across the GPUs. For hyperparameter evolution, each experiment might be treated as a separate training run, which could limit the batch size you can allocate per experiment.
Speed and Efficiency: While hyperparameter evolution can speed up the process of finding optimal hyperparameters by running multiple experiments simultaneously, it can also be more resource-intensive. The speed-up comes from parallelizing the search process, but it doesn't necessarily mean each individual training run will be faster.
To optimize your hyperparameter evolution on multiple GPUs, you might consider the following:
Adjust Batch Size: Ensure that the batch size you specify is a multiple of the number of GPUs and is manageable within the memory constraints of your setup.
Experiment with Fewer Parameters: Start with a smaller set of hyperparameters to evolve and gradually increase as you monitor resource usage.
Use DistributedDataParallel (DDP): Ensure you are using the recommended DistributedDataParallel
mode for better efficiency. Hereβs an example command for running hyperparameter evolution with DDP:
python -m torch.distributed.run --nproc_per_node 3 train.py --batch 57 --data coco.yaml --weights yolov5m6.pt --device 0,1,2 --evolve
For more detailed guidance, you can refer to the Multi-GPU Training documentation.
If you encounter any issues, please ensure you are using the latest versions of the packages and feel free to share more details so we can assist you further.
Best of luck with your hyperparameter evolution! π
@glenn-jocher , thanks for the response. If I have trained a model using standard training, what is the difference between running hyperparameter evolution with the base .pt weights file from the YOLOv5 repository and the newly trained model?
Hi @glitchyordis,
Great question! The difference between running hyperparameter evolution using the base .pt
weights from the YOLOv5 repository versus your newly trained model lies primarily in the starting point of the optimization process.
.pt
Weights from YOLOv5 RepositoryIf you decide to use your newly trained model for hyperparameter evolution, you can specify the path to your custom weights file like this:
python train.py --epochs 10 --data coco128.yaml --weights path/to/your_custom_model.pt --cache --evolve
This approach ensures that the evolution process builds upon the specific strengths of your custom-trained model.
Feel free to experiment with both approaches and compare the results to see which one yields the best performance for your specific use case. If you encounter any issues or have further questions, don't hesitate to reach out. Happy evolving! π
Hi @glenn-jocher , I'm running a hyperparam. evolution (see code below). I've enabled Wandb tracking and I expected the runs to end once we reach 300. However, I can observe that the hyperparameter evolution is still running.
Wandb runs tracking:
evolve directory screen shot:
exp_number=0
project_dir="runs/evolve/label_evo"
python train.py --project "${project_dir}" --name "exp_${exp_number}" --epochs 20 --data 2024-05-17-0719.yaml --img 960 --batch-size 30 --weights yolov5m6.pt --cache ram --device 0 --evolve 300
Hi @glitchyordis,
Thank you for your question. It appears that the hyperparameter evolution is still running beyond the expected 300 generations. Please ensure you are using the latest version of YOLOv5. If the issue persists, you might want to check the evolve.csv
file for any anomalies or interruptions in the logging process. If you need further assistance, feel free to provide additional details.
Search before asking
Question
I've performed normal training with multiple GPU with the m6 model on 3 GPUs, and I could fit a batch_size of 57. However, when I tried the multi GPU code for hyperparameter evolution, I could only specify a batch size of about 18.
How does the multi GPU hyperparameter evolution differs from single GPU run? Does it speed up the progress?
Additional
No response