ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
179 stars 24 forks source link

`GlobalPowerLimitOptimizer` for distributed data parallel training #43

Open jaywonchung opened 3 months ago

jaywonchung commented 3 months ago

GlobalPowerLimitOptimizer works well for single node data parallel training, but in case of distributed data parallel, GPUs in different nodes should make the same final GPU power limit choice. Assuming homogeneous GPUs this is still very likely to happen, but we should make it more robust just in case.