ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
180 stars 24 forks source link

Hugging face trainer callback integration #33

Closed parthraut closed 4 months ago

parthraut commented 4 months ago

Pull request for issue #24 : Draft for HuggingFace Global Power Limit Optimizer. Imports correctly locally, but needs to be further tested on machine with GPUs.

parthraut commented 4 months ago

I simplified the class to only the methods necessary for HFGlobalPowerLimitOptimizer.

Testing of HFGlobalPowerLimitOptimizer is in test.py (I can remove it before merging). It tests 4 things:

  1. The constructor signatures of GPLO and HFGPLO are exactly the same
  2. HFGPLO inherits from TrainerCallback
  3. HFGPLO can be used as a HuggingFace TrainerCallback. Single GPU Test
  4. Ensure that HFGPLO can be used as a HuggingFace TrainerCallback. Multi GPU Test

The output of these tests seem to show that the GlobalPowerLimitOptimizer underneath is being called correctly. A snapshot is provided below:

image

Please let me know if there is anything else I should test or add to the code. Thanks!

parthraut commented 4 months ago

Made all the requested changes. I refactored the example, borrowed it from an example script from huggingface. Also updated the docs, but I was not sure whether to also add it to docs/index.md.

parthraut commented 4 months ago

Updated the code with all suggestions. For the README.md in examples/huggingface, I was unsure what to do for the links to HFGlobalPowerLimitOptimizer, so I just removed them.