ray-project / llmperf

LLMPerf is a library for validating and benchmarking LLMs
Apache License 2.0
659 stars 113 forks source link

Add memory bandwidth utilization metric #31

Open mmcclean-aws opened 10 months ago

mmcclean-aws commented 10 months ago

One of the key metrics in determining if the LLM inference server is performant is by looking at the memory bandwidth utilization. This is a function of the throughput and total GPU/accelerator HBM bandwidth. Calculation taken from PyTorch blog post here: https://pytorch.org/blog/accelerating-generative-ai-2/#step-2-alleviating-memory-bandwidth-bottleneck-through-int8-weight-only-quantization-1574-toks