vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.7k stars 3.91k forks source link

Use LRU cache for CUDA Graphs #2143

Open WoosukKwon opened 9 months ago

WoosukKwon commented 9 months ago

another way to save memory is to use LRUcache for this map, and capture it on demand.

_Originally posted by @scv119 in https://github.com/vllm-project/vllm/pull/1926#discussion_r1427594126_

hmellor commented 5 months ago

@WoosukKwon has this work been done?

SnzFor16Min commented 3 weeks ago

Any update on caching the CUDA graphs?