Closed srikanthsrnvs closed 8 months ago
Currently in the serving engine, if you don't provide max_total_sequence_length
in the KVCacheConfig
, it tries to determine it in a way to use the total available GPU memory. If you want to lower the GPU memory usage, pass the max_total_sequence_length
argument in KVCacheConfig
to an appropriate smaller value
🐛 Bug
When attempting to test speculative decoding using the Speculative decoding predefined test, I get a huge memory usage which results in an OOM on my device
To Reproduce
Steps to reproduce the behavior:
[2024-02-23 03:10:22] INFO engine.py:205: Estimated KVCacheConfig "max_total_sequence_length": 35952. [2024-02-23 03:10:22] INFO engine.py:210: Estimated total single GPU memory usage: 61050.11 MB (Parameters: 9462.36 MB. KVCache: 50669.85 MB. Temporary buffer: 444.78 MB)
Expected behavior
I expect it to use a lot less memory since the models are just 7B and 13B models
Environment