huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.52k stars 25.71k forks source link

[RFC] Tracking and optimizing GPU energy consumption #25782

Closed jaywonchung closed 8 months ago

jaywonchung commented 10 months ago

This RFC suggests the following:

To do this, I believe that integrating Zeus (homepage, repository) with Hugging Face is a good idea.

Disclaimer: I am the maintainer of the Zeus library.

Motivation

Energy measurement and optimization

Deep Learning consumes a lot of energy and thus emits a lot of greenhouse gas. Optimizing the energy consumption/carbon emission of deep learning promotes sustainability and, depending on the user, yields financial benefits by reducing electricity bills and/or carbon offsetting costs.

The goal of tracking energy consumption or carbon emission would be to first raise awareness, and at the same time, facilitate optimization. For both purposes, having accurate and objective measurements is critical. Especially for optimization, people should be able to understand what happens to their optimization objective when they tweak parameters, which is very difficult if the objective is not concretely measurable.

Current state

Hugging Face supports reporting carbon equivalent emissions for the trained model on the Hub with an optional co2_eq_emissions entry in model cards. Today, about 0.6% of the models on Hugging Face Hub have the Carbon Emissions label, which I assume are the model cards that have CO2eq emissions reported. This was also pointed out by a recent study in an academic context -- "... a stalled proportion of carbon emissions-reporting models, ...". So this isn't working ideally at the moment.

Hugging Face tracks carbon emissions via codecarbon, but I believe this has a couple issues.

Proposal

First, I would like to make clear that I'm not arguing that we should remove or replace codecarbon. Rather, I am suggesting that we should also have GPU energy consumption, which yields objective and consistent measurement (regardless of the user's geographical location or time of day) and better potential for optimization (because it's not an estimation), via a software framework that is designed for it (Zeus).

Reducing energy consumption always leads to less operational carbon emission. Also, with a concrete energy measurement in model cards, people can always reconstruct carbon emission by multiplying it with the average carbon intensity of the geographical location and time period the training process took place. In the future, when people get free & more accurate real time carbon intensity data, carbon estimations can be retroactively improved based on energy consumption, too.

Integration considerations

Tracking energy consumption is a cross-cutting concern. This is a non-exhaustive list of considerations and my opinions.

Implementation considerations

Policy considerations

Optimizing energy consumption

While this may not be an immediate next milestone, integrating Zeus with Hugging Face Transformer has energy optimization as its core goal.

Zeus currently offers two optimization methods that find the optimal GPU power limit $p$ during training:

\min_{p \in \mathcal{P}} \quad \eta \cdot \mathrm{Energy} + (1 - \eta) \cdot \mathrm{TDP} \cdot \mathrm{Time}

and

\begin{align}
\min_{p \in \mathcal{P}} & \quad \mathrm{Energy} \\
s.t. & \quad \mathrm{Slowdown} \le s
\end{align}

where the user chooses $0 \le \eta \le 1$ (relative importance between time and energy) or $s \ge 1$ (maximum tolerable slowdown ratio). $\textrm{TDP}$ is the maximum power consumption of the GPU. For instance, the second optimization method given $s = 1.1$ will find the power limit that consumes the least energy while bounding training iteration time below 110% of the original training iteration time.

The power limit optimizer is implemented so that it's compatible with Hugging Face Trainer callbacks.

from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

# Data parallel training with four GPUs
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

Our publication has additional details.

Your contribution

I would love to get helping hands, but I also acknowledge that we won't be talking about raising awareness if there were plenty of people willing to implement these. ;) So by default, I'll expect to be doing everything I mentioned here myself. Being the maintainer of Zeus myself, I can make changes to Zeus whenever specific needs arise during and after integration.

I can dive right into integration with a PR, or I can post a more detailed implementation plan RFC -- whichever works for existing contributors. I am willing to smooth out rough edges, fix bugs, and add more features in the future. Zeus is a central part of my ongoing PhD work and I have at least three more years to go, so I have good motivation and incentive.

LysandreJik commented 9 months ago

Maybe of interest to @sashavor :)

sashavor commented 9 months ago

But we already integrated CodeCarbon, right? Not clear what this adds on top of that

jaywonchung commented 9 months ago

Thanks for the comment. I would like to point you to the Current State section in the RFC body. In short, the integration of CodeCarbon with Hugging Face is not being maintained at all with known issues not being resolved, and it provides an estimation for carbon emission, which is difficult to optimize. The end goal of this RFC is not reporting, but introducing the tooling for optimizing energy consumption.

sashavor commented 9 months ago

I think it makes more sense to maintain codecarbon rather than add another package. We were just talking about this with @julien-c the other day, we hope to pursue this in the very near future :hugs:

jaywonchung commented 9 months ago

codecarbon integration being maintained is a great news for the community, thank you! But I would like to again make clear the gist of this RFC: I believe reporting should not be the end goal; reporting is a means for optimization, and I don't think codecarbon good in that respect. Optimization of course does not have to happen through Zeus, but with Transformers being an open source framework, an active maintainer can help things actually move.

sashavor commented 9 months ago

I agree that an active maintainer is useful, which is why we were talking about it with @julien-c :)

jaywonchung commented 9 months ago

I'm happy to hear that there could potentially be an active maintainer for energy/carbon issues in Hugging Face. And I understand that integrating with an external package is by no means a light decision and it's up to the repository maintainers to make the call. When Hugging Face is thinking about energy and carbon optimization, it would be great if we can chat and see how can be of assistance :)

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.