Closed jaywonchung closed 1 year ago
Maybe of interest to @sashavor :)
But we already integrated CodeCarbon, right? Not clear what this adds on top of that
Thanks for the comment. I would like to point you to the Current State
section in the RFC body. In short, the integration of CodeCarbon with Hugging Face is not being maintained at all with known issues not being resolved, and it provides an estimation for carbon emission, which is difficult to optimize. The end goal of this RFC is not reporting, but introducing the tooling for optimizing energy consumption.
I think it makes more sense to maintain codecarbon rather than add another package. We were just talking about this with @julien-c the other day, we hope to pursue this in the very near future :hugs:
codecarbon
integration being maintained is a great news for the community, thank you! But I would like to again make clear the gist of this RFC: I believe reporting should not be the end goal; reporting is a means for optimization, and I don't think codecarbon
good in that respect. Optimization of course does not have to happen through Zeus, but with Transformers being an open source framework, an active maintainer can help things actually move.
I agree that an active maintainer is useful, which is why we were talking about it with @julien-c :)
I'm happy to hear that there could potentially be an active maintainer for energy/carbon issues in Hugging Face. And I understand that integrating with an external package is by no means a light decision and it's up to the repository maintainers to make the call. When Hugging Face is thinking about energy and carbon optimization, it would be great if we can chat and see how can be of assistance :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This RFC suggests the following:
To do this, I believe that integrating Zeus (homepage, repository) with Hugging Face is a good idea.
Disclaimer: I am the maintainer of the Zeus library.
Motivation
Energy measurement and optimization
Deep Learning consumes a lot of energy and thus emits a lot of greenhouse gas. Optimizing the energy consumption/carbon emission of deep learning promotes sustainability and, depending on the user, yields financial benefits by reducing electricity bills and/or carbon offsetting costs.
The goal of tracking energy consumption or carbon emission would be to first raise awareness, and at the same time, facilitate optimization. For both purposes, having accurate and objective measurements is critical. Especially for optimization, people should be able to understand what happens to their optimization objective when they tweak parameters, which is very difficult if the objective is not concretely measurable.
Current state
Hugging Face supports reporting carbon equivalent emissions for the trained model on the Hub with an optional
co2_eq_emissions
entry in model cards. Today, about 0.6% of the models on Hugging Face Hub have theCarbon Emissions
label, which I assume are the model cards that have CO2eq emissions reported. This was also pointed out by a recent study in an academic context -- "... a stalled proportion of carbon emissions-reporting models, ...". So this isn't working ideally at the moment.Hugging Face tracks carbon emissions via
codecarbon
, but I believe this has a couple issues.13231. It was acknowledged that
codecarbon
has some quirks and its integration with Transformers is not ideal, but the issue was closed due to lack of activity. Probably the largest problem is the lack of maintainers than anything. The only code commit related tocodecarbon
is the one that introduced it (037e466b105), and the author of the commit is no longer with Hugging Face. This prevents turning carbon accounting on by default.codecarbon
. It primarily focuses on reporting.Proposal
First, I would like to make clear that I'm not arguing that we should remove or replace
codecarbon
. Rather, I am suggesting that we should also have GPU energy consumption, which yields objective and consistent measurement (regardless of the user's geographical location or time of day) and better potential for optimization (because it's not an estimation), via a software framework that is designed for it (Zeus).Reducing energy consumption always leads to less operational carbon emission. Also, with a concrete energy measurement in model cards, people can always reconstruct carbon emission by multiplying it with the average carbon intensity of the geographical location and time period the training process took place. In the future, when people get free & more accurate real time carbon intensity data, carbon estimations can be retroactively improved based on energy consumption, too.
Integration considerations
Tracking energy consumption is a cross-cutting concern. This is a non-exhaustive list of considerations and my opinions.
Implementation considerations
nvidia-smi
is a wrapper of NVML) is required for NVIDIA GPUs (ROCm SMI for AMD GPUs). Fortunately, NVML (libnvidia-ml.so
) is already part of the CUDA toolkit since version 8.0, and even NVIDIA/cudabase
and pytorch/pytorch official Docker images ship with NVML baked in. Zeus is pure Python.nvidia-smi
during training.Trainer.save_metrics
is the right place.nvidia-smi
during training and will stay transparent to users. Our code should check whether NVML/ROCm SMI is installed in the user's environment and just disable itself if not, instead of raising an error.Policy considerations
torch.cuda.synchronize
orjax.block_until_ready
). So this should be an opt-in feature that is only done a bounded number of times for the purpose of, for instance, profiling for energy optimization.energy_consumption_joules.gpu: list[float]
-- one float per GPU).Optimizing energy consumption
While this may not be an immediate next milestone, integrating Zeus with Hugging Face Transformer has energy optimization as its core goal.
Zeus currently offers two optimization methods that find the optimal GPU power limit $p$ during training:
and
where the user chooses $0 \le \eta \le 1$ (relative importance between time and energy) or $s \ge 1$ (maximum tolerable slowdown ratio). $\textrm{TDP}$ is the maximum power consumption of the GPU. For instance, the second optimization method given $s = 1.1$ will find the power limit that consumes the least energy while bounding training iteration time below 110% of the original training iteration time.
The power limit optimizer is implemented so that it's compatible with Hugging Face Trainer callbacks.
Our publication has additional details.
Your contribution
I would love to get helping hands, but I also acknowledge that we won't be talking about raising awareness if there were plenty of people willing to implement these. ;) So by default, I'll expect to be doing everything I mentioned here myself. Being the maintainer of Zeus myself, I can make changes to Zeus whenever specific needs arise during and after integration.
I can dive right into integration with a PR, or I can post a more detailed implementation plan RFC -- whichever works for existing contributors. I am willing to smooth out rough edges, fix bugs, and add more features in the future. Zeus is a central part of my ongoing PhD work and I have at least three more years to go, so I have good motivation and incentive.