ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
180 stars 24 forks source link

`OperationProfiler` and `PerseusOptimizer` server and client #21

Open jaywonchung opened 9 months ago

jaywonchung commented 9 months ago

Perseus is an energy scheduler for large model training (although we're looking into applying this for large model inference, too).

Perseus requires the time and energy consumption profiling results of each forward and backward computations in each pipeline stage in order to schedule energy with lowtime. That's what OperationProfiler will do.

The PerseusOptimizer server will, for now, receive a Python file that lists GPU frequencies (produced by lowtime) and instruct the PerseusOptimizer client (integrated into the user's training framework) to change GPU frequencies. The server-client split is beneficial in order for Perseus to be agnostic to the training framework. Otherwise, energy scheduling (which requires a holistic view of all computations that happen across all ranks, i.e. the "policy") and the method of realizing the energy schedule in a distributed fashion (i.e., the "mechanism") end up being coupled.