[RFC] Tracking and optimizing GPU energy consumption

jaywonchung commented 1 year ago

This RFC suggests the following:

Let's measure and track GPU energy consumption across Hugging Face frameworks.
Let's aim to optimize GPU energy consumption, while being mindful of existing performance goals like training time.

To do this, I believe that integrating Zeus (homepage, repository) with Hugging Face is a good idea.

Disclaimer: I am the maintainer of the Zeus library.

Motivation

Energy measurement and optimization

Deep Learning consumes a lot of energy and thus emits a lot of greenhouse gas. Optimizing the energy consumption/carbon emission of deep learning promotes sustainability and, depending on the user, yields financial benefits by reducing electricity bills and/or carbon offsetting costs.

The goal of tracking energy consumption or carbon emission would be to first raise awareness, and at the same time, facilitate optimization. For both purposes, having accurate and objective measurements is critical. Especially for optimization, people should be able to understand what happens to their optimization objective when they tweak parameters, which is very difficult if the objective is not concretely measurable.

Current state

Hugging Face supports reporting carbon equivalent emissions for the trained model on the Hub with an optional co2_eq_emissions entry in model cards. Today, about 0.6% of the models on Hugging Face Hub have the Carbon Emissions label, which I assume are the model cards that have CO2eq emissions reported. This was also pointed out by a recent study in an academic context -- "... a stalled proportion of carbon emissions-reporting models, ...". So this isn't working ideally at the moment.

Hugging Face tracks carbon emissions via codecarbon, but I believe this has a couple issues.

At the end of the day, it provides an estimation of carbon emission, not a measurement. It loses accuracy because not all geographical locations have energy mix or carbon intensity data available. Even in locations where yearly average values are known, it does not take real-time changes in carbon intensity into account, which can vary significantly within a day. It's probably because real-time carbon intensity information is not free today (e.g., ElectricityMap).
13231. It was acknowledged that codecarbon has some quirks and its integration with Transformers is not ideal, but the issue was closed due to lack of activity. Probably the largest problem is the lack of maintainers than anything. The only code commit related to codecarbon is the one that introduced it (037e466b105), and the author of the commit is no longer with Hugging Face. This prevents turning carbon accounting on by default.
Optimization is currently not a goal of codecarbon. It primarily focuses on reporting.

Proposal

First, I would like to make clear that I'm not arguing that we should remove or replace codecarbon. Rather, I am suggesting that we should also have GPU energy consumption, which yields objective and consistent measurement (regardless of the user's geographical location or time of day) and better potential for optimization (because it's not an estimation), via a software framework that is designed for it (Zeus).

Reducing energy consumption always leads to less operational carbon emission. Also, with a concrete energy measurement in model cards, people can always reconstruct carbon emission by multiplying it with the average carbon intensity of the geographical location and time period the training process took place. In the future, when people get free & more accurate real time carbon intensity data, carbon estimations can be retroactively improved based on energy consumption, too.

Integration considerations

Tracking energy consumption is a cross-cutting concern. This is a non-exhaustive list of considerations and my opinions.

Implementation considerations

Software dependencies: NVML (nvidia-smi is a wrapper of NVML) is required for NVIDIA GPUs (ROCm SMI for AMD GPUs). Fortunately, NVML (libnvidia-ml.so) is already part of the CUDA toolkit since version 8.0, and even NVIDIA/cuda base and pytorch/pytorch official Docker images ship with NVML baked in. Zeus is pure Python.
Supporting both PyTorch and JAX: GPU energy measurement is agnostic to the Deep Learning framework used. It's no different from running nvidia-smi during training.
Checkpointing: When model training is suspended and resumed, the current energy consumption should be saved as part of the checkpoint. In other words, energy consumed until now becomes a new training state. I think Trainer.save_metrics is the right place.
Model card: Along with carbon emission, energy consumption will be added to model cards. I'll come up with a proposal of the exact schema if we decide to do this, since I'm guessing this is difficult to change later.
Measurement overhead: Ideally it should be very low, affecting performance minimally. Volta or newer NVIDIA GPU architectures support querying the cumulative energy consumption of the device since driver load. Thus, one function call before and after training and one subtraction is all we need for recent GPUs. For older GPUs, we will have to spawn a separate Python process that polls the GPU's power consumption. This process will run completely in parallel from the training process and would not affect performance. NVML function calls typically take around 10 ms.
Transparency: User experience with or without energy measurement should be identical. Tracking energy is no different from running nvidia-smi during training and will stay transparent to users. Our code should check whether NVML/ROCm SMI is installed in the user's environment and just disable itself if not, instead of raising an error.
Supporting both NVIDIA and AMD GPUs: NVML and ROCm SMI have the same set of functions for power/energy measurement, just with different names.

Policy considerations

On vs. off by default: Having low measurement overhead and transparency to users remove the technical barrier for making this on-by-default, but I don't know how people would feel if energy consumption measurements are automatically pushed to Hugging Face Hub. Builders rarely care about energy at this point, so people may not care either way. Then we should turn it on by default. If it's not on by default, the 0.6% number will not improve.
Measurement granularity: Measuring is for optimization, and optimization benefits from finer-grained and precise measurements. Especially, different models train different number of steps on datasets of various sizes. Thus, average energy consumption per iteration can be valuable when people want to compare between different models. However, iteration level measurements may incur performance overhead because CPU and GPU code need to be synchronized for accurate measurement (i.e., torch.cuda.synchronize or jax.block_until_ready). So this should be an opt-in feature that is only done a bounded number of times for the purpose of, for instance, profiling for energy optimization.
Energy consumption of other components: There are many other components in a computing system that consume energy (thinking of both PCs and data center servers) -- CPU, DRAM, HDD, SSD, Network switch, Cooling system, etc. We should draw a line somewhere, where the convenience of measurement and the usefulness of measurement strike a good tradeoff. Arguably, GPUs are (1) the largest energy consumer for Deep Learning workloads (Something like 75%; See table 1), (2) typically not shared between training runs (So we don't have to think about splitting and attributing the energy consumption of a shared hardware component), and (3) quite homogeneous in terms of hardware vendor and measurement library support. This is in contrast with CPU and DRAM. Not all CPUs support Intel RAPL energy measurements, and as far as I know, AMD CPUs do not support DRAM energy measurement via RAPL. All in all, my belief is that at the moment, measuring and optimizing just GPU energy consumption strikes a good balance, although this part is always open to extension. In any case, we should also be very clear from the name of the field in model cards what we're measuring and reporting (e.g., energy_consumption_joules.gpu: list[float] -- one float per GPU).

Optimizing energy consumption

While this may not be an immediate next milestone, integrating Zeus with Hugging Face Transformer has energy optimization as its core goal.

Zeus currently offers two optimization methods that find the optimal GPU power limit $p$ during training:

\min_{p \in \mathcal{P}} \quad \eta \cdot \mathrm{Energy} + (1 - \eta) \cdot \mathrm{TDP} \cdot \mathrm{Time}

and

\begin{align}
\min_{p \in \mathcal{P}} & \quad \mathrm{Energy} \\
s.t. & \quad \mathrm{Slowdown} \le s
\end{align}

where the user chooses $0 \le \eta \le 1$ (relative importance between time and energy) or $s \ge 1$ (maximum tolerable slowdown ratio). $\textrm{TDP}$ is the maximum power consumption of the GPU. For instance, the second optimization method given $s = 1.1$ will find the power limit that consumes the least energy while bounding training iteration time below 110% of the original training iteration time.

The power limit optimizer is implemented so that it's compatible with Hugging Face Trainer callbacks.

from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

# Data parallel training with four GPUs
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

Our publication has additional details.

Your contribution

I would love to get helping hands, but I also acknowledge that we won't be talking about raising awareness if there were plenty of people willing to implement these. ;) So by default, I'll expect to be doing everything I mentioned here myself. Being the maintainer of Zeus myself, I can make changes to Zeus whenever specific needs arise during and after integration.

I can dive right into integration with a PR, or I can post a more detailed implementation plan RFC -- whichever works for existing contributors. I am willing to smooth out rough edges, fix bugs, and add more features in the future. Zeus is a central part of my ongoing PhD work and I have at least three more years to go, so I have good motivation and incentive.

LysandreJik commented 1 year ago

Maybe of interest to @sashavor :)

sashavor commented 1 year ago

But we already integrated CodeCarbon, right? Not clear what this adds on top of that

jaywonchung commented 1 year ago

Thanks for the comment. I would like to point you to the Current State section in the RFC body. In short, the integration of CodeCarbon with Hugging Face is not being maintained at all with known issues not being resolved, and it provides an estimation for carbon emission, which is difficult to optimize. The end goal of this RFC is not reporting, but introducing the tooling for optimizing energy consumption.

sashavor commented 1 year ago

I think it makes more sense to maintain codecarbon rather than add another package. We were just talking about this with @julien-c the other day, we hope to pursue this in the very near future :hugs:

jaywonchung commented 1 year ago

codecarbon integration being maintained is a great news for the community, thank you! But I would like to again make clear the gist of this RFC: I believe reporting should not be the end goal; reporting is a means for optimization, and I don't think codecarbon good in that respect. Optimization of course does not have to happen through Zeus, but with Transformers being an open source framework, an active maintainer can help things actually move.

sashavor commented 1 year ago

I agree that an active maintainer is useful, which is why we were talking about it with @julien-c :)

jaywonchung commented 1 year ago

I'm happy to hear that there could potentially be an active maintainer for energy/carbon issues in Hugging Face. And I understand that integrating with an external package is by no means a light decision and it's up to the repository maintainers to make the call. When Hugging Face is thinking about energy and carbon optimization, it would be great if we can chat and see how can be of assistance :)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers