Open sfbemerk opened 3 months ago
i also occur this error in Yi-34B
cc @andoorve
how about using a single node and pp size 2 ( so that it uses multiprocessing backend)? does it still have this issue?
I just tested it, but the bug appears there as well: Setup C) one docker container with 2 GPUs (pipeline-parallel, no ray)
docker run ... vllm/vllm-openai:v0.5.4 --model "meta-llama/Meta-Llama-3.1-70B-Instruct" --tensor-parallel-size 1 --pipeline-parallel-size 2 --max-model-len=4096
again the incomplete response:
{
"id": "chat-28622b0c3d074caead37fb5441ca2407",
"object": "chat.completion",
"created": 1722964614,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "chatcmpl-tool-9dfa147ec16245d7a8efd67ddfc6012f",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location"
}
}
]
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 251,
"total_tokens": 254,
"completion_tokens": 3
}
}
I solved this, as you only need to add "min_tokens": **. I set it to 50 due to my completion length.
I did some investigation, the problem is quite deep. In the case of guided decoding we are using stateful logits processors. However, in PP, the logits processors will get cloned when sent to the workers other than the driver workers. This is ok for the TP use cases since the logits processor is not cloned and the state lives as long as the sequence group does. This is an issue for beyond PP however, and will affect anything SPMD. For those cases though, one solution might be to have the logits processor live on the worker. Pinging @njhill as this is quite similar to the seed issue and he might have some suggestions on the best way to resolve it.
MP GPU Executor, [<vllm.model_executor.guided_decoding.outlines_logits_processors.JSONLogitsProcessor object at 0x7f57d577eaf0>]
Rank: 0, [<vllm.model_executor.guided_decoding.outlines_logits_processors.JSONLogitsProcessor object at 0x7f57d577eaf0>]
(VllmWorkerProcess pid=2199329) Rank: 1, [<vllm.model_executor.guided_decoding.outlines_logits_processors.JSONLogitsProcessor object at 0x7f5811ac5ac0>]
I think we need to overhaul logits processor part. it should not be a part of sampling parameter.
sampling parameter should just store the constraint, like regex or json schema. and the constraint decoding state (i.e. FSM state) should live in each sequence. The FSM itself, with masks for every state, should be retrieved from some cache that maps from regex to FSM.
Yes exactly
sampling parameter should just store the constraint, like regex or json schema. Yup agreed
and the constraint decoding state (i.e. FSM state) should live in each sequence. The FSM itself, with masks for every state, should be retrieved from some cache that maps from regex to FSM.
We can keep the FSM as part of logits processor but map from sequence to FSM in the worker similar to torch Generator. Ideally this is done in a clean and future-proof way.
@rkooo567 You mention support for guided decoding in https://github.com/vllm-project/vllm/pull/7109. Is this something you're looking into?
@rkooo567 You mention support for guided decoding in https://github.com/vllm-project/vllm/pull/7109. Is this something you're looking into?
not exactly, but to make it work with SPMD, I believe we need exactly the same solution. So I may look into it soon. https://github.com/vllm-project/vllm/pull/7109
also totally agreed with @youkaichao. I think that's the direction we should go (and what we wanted to achieve with https://github.com/vllm-project/vllm/issues/5423).
Yes @rkooo567 I think a lot of the stuff from SPMD is basically directly the same for PP.
There is already as a PR to make the stateful logit processor shareable
https://github.com/vllm-project/vllm/pull/5329
The idea is as described above; pass around logit_processor factories instead of instantiation, and then each Sequence (that lives on each worker) has its own instantiation of logit_processor
@maxdebayser are you going to continue your work in https://github.com/vllm-project/vllm/pull/5329?
To make https://github.com/vllm-project/vllm/pull/5329 work with https://github.com/vllm-project/vllm/pull/7109, I think the SequenceDataDelta
will leave the SequenceData.logit_processors
untouched, and logit_processor will obtain the state from SequenceData
as per usual once it has been updated with SequenceDataDelta
.
Meaning to say, probably minimal changes are required.
I think we need to overhaul logits processor part. it should not be a part of sampling parameter.
sampling parameter should just store the constraint, like regex or json schema. and the constraint decoding state (i.e. FSM state) should live in each sequence. The FSM itself, with masks for every state, should be retrieved from some cache that maps from regex to FSM.
Absolutely, this has been planned for some time - a larger overhaul of how logits processors work is planned e.g. https://github.com/vllm-project/vllm/issues/5423 but is stalled a bit given how it needs to fit with other changes. I feel we can make some incremental improvements in the meantime, potentially starting with #5329 that @jon-chuang referenced.
Re needing the state to be in the worker for PP etc, I know it's also required for SPMD but imo should just be part of making the batches stateful (same for the torch.Generators).
@jon-chuang , I've been focusing on other issues in the past couple of weeks, but if this can be an incremental step on the way of a broader refactoring as @njhill mentioned, I'll sync this PR with main again.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
vllm v0.5.4
Setup A) single docker container with vllm, no pipeline-parallelism
Setup B) two docker containers with ray + vllm (pipeline parallelism)
The issue does not depend on the model; e.g. it also appears with
meta-llama/Meta-Llama-3-70B-Instruct
instead of Llama-3.1🐛 Describe the bug
Without pipeline-parallelism, a request with tool calling is responded correctly with a valid tool call. With pipeline-parallelism, the same request is responded with incomplete tool call (just a few tokens long, but still status 200).
Example request:
Response for setup A (no pipeline-parallelism)
The response for setup B (pipeline-parallelism with ray)