vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.67k stars 4.48k forks source link

[Usage]: why can't the `max_model_len` be greater than `max_position_embeddings` for llama2? #4346

Open sleepwalker2017 opened 6 months ago

sleepwalker2017 commented 6 months ago

Your current environment

when we are running prefill stage, vllm take multiple requests to do prefill.

Is this limited by the max_position_embeddings?

I think it's not limited by this number, because each sequence has its own start index, we only need to ensure that each sequence is shorter than max_position_embeddings. Is that the fact?

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Playerrrrr commented 6 months ago

cant we have something like automated rope scaling like in alpindales Aphrodite Engine? @WoosukKwon

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!