Use LRU cache for CUDA Graphs

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.7k stars 3.91k forks source link

Open WoosukKwon opened 9 months ago

WoosukKwon commented 9 months ago

another way to save memory is to use LRUcache for this map, and capture it on demand.

hmellor commented 5 months ago

@WoosukKwon has this work been done?

SnzFor16Min commented 3 weeks ago

Any update on caching the CUDA graphs?