Closed SinanAkkoyun closed 2 months ago
What are your cache_size
and max_seq_len
args for the loaded model?
It appears you aren't sending any arguments with the generation request other than stream = False
. By default, the argument for max_tokens
generated is max_seq_len
minus the length of your prompt, which will reserve a full max_seq_len
worth of cache on each request. If your cache_size
is not greater than max_seq_len
in this setting, your prompts will all go sequentially with a max batch size of 1.
By default, cache_size
= max_seq_len
if not specified, in order to minimize the chance of OOM. You should ideally specify the maximum cache_size
that fits in your VRAM without OOM for optimal batching.
Thank you! It now works :)
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Describe the bug
Hey, thank you for the awesome work, I greatly appreciate it! When running the API with default config and then running 10 concurrent API chat requests, no batching is happening at all. All requests run sequentially, although the exllama dynamic generator should be able to process incoming continuous batches
Reproduction steps
Run the latest TabbyAPI server (with default config)
Then, run this script:
You will notice that all requests run sequentially (the script is fine, it runs concurrently with vLLM)
Expected behavior
The API should process all API calls simultaneously
Logs
No response
Additional context
No response
Acknowledgements