vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.81k stars 4.68k forks source link

[Feature]: Allow head_size smaller than 128 on TPU with Pallas backend #10343

Open manninglucas opened 1 week ago

manninglucas commented 1 week ago

🚀 The feature, motivation and pitch

I would like to serve smaller models (e.g facebook/opt-125m) using VLLM on TPU. I can't do this currently because the Pallas backend has the limitation NotImplementedError: Head size must be a multiple of 128. I can't find a reason why this limitation is in place, and it would be great to be able to remove it with a flag or entirely. If my understanding is incorrect and there is a reason to have this limitation in place, please let me know! Thanks for your work on VLLM.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

manninglucas commented 1 week ago

Here is a code pointer to the limitation FWIW.

https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/pallas.py#L112