open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.18k stars 361 forks source link

[Feature] Auto Batch size #1220

Open joihn opened 1 year ago

joihn commented 1 year ago

What is the feature?

When deploying trainning everyday different machine having different GPU, it's tedious to re-tune the batch size manually (trying to maximize it without the training crashing due to out of memory)

It would be cool to have an "auto batch size" like in yolov5 https://github.com/ultralytics/yolov5

I could implement it myself if someone gives me architectural adivce on where it wold best be implemented

Any other context?

No response

zhouzaida commented 1 year ago

Hi @joihn , thanks for your suggestion. We can have a discussion about the expected usage of auto batch size and how to implement it in this issue.

Could you first introduce your idea about them?

Reference

LALBJ commented 1 year ago

Hi @zhouzaida , I'm participating in the OpenMMLab Code Camp task. Currently, I have a rough implementation idea, referring to toma's approach:

  1. Wrap the training function with an auto batch size decorator.
  2. Implement capturing of Out of Memory (OOM) errors within the auto batch size.
  3. When an error is caught, update the batch size in the dataloader.

However, I have two questions regarding how to implement this functionality in mmengine:

  1. How should we indicate the use of auto batch size? Should we wrap it with a decorator or declare it through adding a configuration option?
  2. If using a decorator to include the training function, where would be the appropriate place to declare the decorator?

These are my current questions. If there are any misunderstandings in my description of the task, please feel free to point them out.