ml-energy / zeus

Deep Learning Energy Measurement and Optimization
https://ml.energy/zeus
Apache License 2.0
180 stars 24 forks source link

[RFC] Carbon-Aware-Zeus #8

Closed zyang37 closed 1 year ago

zyang37 commented 1 year ago

Make Zeus Carbon-Aware! (DONE)

Overview

Nowadays, training deep neural networks (DNN) consumes an increasing among of energy. Zeus explores the trade-off between energy consumption and performance optimization. It finds the optimal GPU-level configurations, and batch size for DNN training. However, not all electricity is produced in the same way; burning fossil fuel is the most common way to produce electricity, we also have cleaner sources like wind and solar. The carbon intensity of electricity (Gram of carbon per kilowatt hour) is used to measure how clear the electricity is. Depending on when and where you run your train job, the Carbon intensity of electricity can range from 0 to #kg per kilowatt hour! To take Zeus to the next level, we participated in the Carbon Hack 22, leveraging the Carbon Aware SDK/API to bond Zeus with carbon-footprint. project page

Proposed Design

The original Zeus objective (cost) function contains two main components: Energy-to-Accuracy (ETA) and Time-to-Accuracy (TTA). We plan to replace Energy-to-Accuracy (ETA) with Carbon-to-Accuracy (CTA). CTA can be obtained by simply multiplying ETA and the average carbon intensity of a period. Since carbon emissions and energy consumption has a linear relationship; carbon emission and energy consumption will be optimized jointly. The only special case is when carbon intensity is 0, but the electricity at the time will be 100% clean. Ideally, to obtain an accurate CTA the new cost function should use the forecasted carbon intensity data for the next epoch. Then Zeus will adjust the GPU power limit to minimize cost.

Goal

Detailed Plan

Carbon Hack 22 ### Week 1 (Oct. 17 - Oct. 21): - [x] Understand ZeusDataLoader + analyze.py - [x] Notes from Luoxi - [x] Python tool kit for accessing the [Carbon-Aware WebApi](https://carbon-aware-api.azurewebsites.net/swagger/index.html) - [x] Get carbon intensity given a timeframe from the past. - [x] Ideally, future carbon intensity for a given time frame (Estimated epoch time) - [x] Let it run for at least a day and plot the data (intensity vs time) for a sanity check. - [x] Try to get forecast carbon data (discord) ### Week 2 (Oct. 24 - Oct. 28): - [x] Integrate API into ZeusDataLoader (push updates to a new branch) - [x] Estimate epoch time (Use `ZeusDataLoader.train_epoch_time` and `ZeusDataLoader.eval_epoch_time`) - [x] Change the optimization function (when carbon intensity is high, lower GPU power limit) - [x] Comparison experiment (Carbon-unaware Zeus v.s. Carbon-aware Zeus), ### Week 3 (Oct. 31 - Nov. 4): - [x] Gather results - [x] Plot: Time vs power limit (use a long training job) - [x] Total carbon emission - [x] Work on the slides - [x] Writing script for the video pitch - [x] Text pitch + Buffer

Deployment

Comments welcome!

jaywonchung commented 1 year ago

Thanks for the write up! This is awesome. Some comments and questions:

jaywonchung commented 1 year ago

Also, we will probably need some energy and carbon reporting, too. Once after each epoch, and total energy and carbon when training finishes. It can be super crude and ugly for the Hackathon.

jaywonchung commented 1 year ago

@zyang37 Great job making this thing happen!! Thanks again for your work 👍

This is a great first step towards carbon-awareness in Zeus. We should upstream this feature. It would be great if you can update this issue with a plan to "featurize" everything that's hardcoded at the moment. Also, now code quality should match its surroundings.

Some random bullets:

zyang37 commented 1 year ago

Thanks for your comment and support! I will look into this.

As you mentioned in the 2nd bullet point, the current setup forMAX_CARBON_INTENSITY and \eta_knob is not 100% concrete, and I think it will be good to also spend some time refining the carbon cost function.

jaywonchung commented 1 year ago

While I agree that the cost metric itself has a lot of room for improvement, I would have to say that it might be a challenging research problem that should be treated separately. Since the current version of the cost metric works, what about we first upstream this as is for the time being?