Typical energy consumption / emitted CO2 of an LLM for serving

these numbers seem way off. it's time to get out the napkin!

an iphone 15 battery holds 13 Wh per full charge, and it will take ~15 Wh to charge it fully. vLLM can generate 1900 token/s on a single A100 GPU which uses less than 400W, or 0.0585 mWh/token. even for a 512 token response (this is a very long response) that's only 0.03 Wh per request. vLLM authors present the throughput with real world usage data as 112 requests/min on an A10g which uses less than 150W, which works out to 0.02 Wh per request or > 600 requests per phone charge equivalent. (this is $0.00001 / request or 100,000 requests/dollar in electricity costs, even at bloated hawaii electricity prices!)

the energy usage will be proportionally higher for larger model size, and also for smaller batch sizes, but not 1000x larger. at least, not until you get into really big models like GPT-4, with really long contexts, but those models are often located in large datacenters that use hydroelectric or wind power, so it gets more complicated to calculate the CO2 equivalent. the most impactful thing you can do for reducing energy consumption is to use larger batch sizes on newer chips, and after that try to use smaller models and context lengths.

ray-project / llm-numbers

Typical energy consumption / emitted CO2 of an LLM for serving #27