Closed pseudotensor closed 1 month ago
Can you show the server-side stack trace?
I did, it only shows that "Internal Server Error" as I mentioned there's literally nothing else in the server trace except that. No stack trace etc.
To better debug the issue, can you use guided decoding in offline inference via LLM.chat
method? That should show the full stack trace.
i have the same issue but the environment i have access to;
the following just hangs :(. the api server throws internal server error.
from vllm import LLM
llm = LLM(
model="/root/models/mistralai/Pixtral-12B-2409",
tokenizer_mode="mistral",
served_model_name="mistralai/Pixtral-12B-2409",
max_model_len=5*4096,
guided_decoding_backend="outlines",
limit_mm_per_prompt={"image": 5},
tensor_parallel_size=4,
)
I don't think guided_decoding/outlines officially supports mistral tokenizer (we still need to double check on this), and I don't think it's really vLLM's responsibility to make sure they work with each other if they don't. However, if they are indeed incompatible, then we should disable guided_decoding when mistral tokenizer is present.
Perhaps @patrickvonplaten you might have some thoughts for this?
For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?
For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?
Yea, I think that's a good idea and something rather straightforward to do!
The latest code of lm-format-enforcer
should now be compatible with the MistralTokenizer
. There is no release yet, but installing the library from main
should do the trick:
pip install git+https://github.com/noamgat/lm-format-enforcer.git --force-reinstall
@stikkireddy your code should run now, if you switch the guided_decoding_backend
to lm-format-enforcer
from vllm import LLM
llm = LLM(
model="/root/models/mistralai/Pixtral-12B-2409",
tokenizer_mode="mistral",
served_model_name="mistralai/Pixtral-12B-2409",
max_model_len=5*4096,
guided_decoding_backend="lm-format-enforcer",
limit_mm_per_prompt={"image": 5},
tensor_parallel_size=4,
)
Your current environment
All seems to work for sending an image query. But as soon as I try any simple guided_json or guided_choice, it always fails.
gives:
and vllm shows:
Model Input Dumps
No response
🐛 Describe the bug
See above.
Before submitting a new issue...