Cache kv blocks for faster initialization, modify model and cache args to allow for higher seq lens

tenstorrent / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

5 stars 1 forks source link

Closed skhorasganiTT closed 1 month ago

skhorasganiTT commented 1 month ago

Cache kv blocks for faster initialization
Set max_model_len and max_num_batched_tokens to 128*1024 to allow running larger seq lens
Set max batch size in TT llama directly from scheduler_config.max_num_seqs