Open sunggg opened 10 months ago
How is the user side cancellation triggered? When I tried by ctrl-c a running curl command, I can see the cancellation gets processed.
script:
payload='{
"model": "llama-2",
"messages": [
{
"role": "user",
"content": "Hello! what is the answer to life, the universe, and everything? give me a long answer"
}
],
"max_tokens": 1000,
"stream": true,
"temperature": 1.0,
"top_p": 1,
"presence_penalty": 0,
"frequency_penalty": 0
}'
echo "======="
echo "Request"
echo "======="
echo "$payload" | jq
echo "========"
echo "Response"
echo "========"
curl -s -X 'POST' \
'http://127.0.0.1:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer abc" \
-d "$payload"
log:
2024-01-31 20:58:40 [info ] StagingInferenceEngine.add [mlc_serve.engine.staging_engine] func_name=add lineno=106 pathname=/opt/dlami/nvme/liteye/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=2803754 requests=[Request(request_id='cmpl-71e9e27ce9f842108e3e820b1b6d63c8', messages=[ChatMessage(role='user', content='Hello! what is the answer to life, the universe, and everything? give me a long answer')], num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, logprobs=False, top_logprobs=0), stopping_criteria=StoppingCriteria(max_tokens=1000, stop_sequences=[]), debug_options=DebugOptions(ignore_eos=False, prompt=None, prompt_token_ids=None), validate_tokens=None, contextvars={})]
2024-01-31 20:58:40 [info ] AsyncEngineConnector.generate iterator cancelled. [mlc_serve.engine.async_connector] func_name=generate lineno=90 pathname=/opt/dlami/nvme/liteye/mlc-llm/serve/mlc_serve/engine/async_connector.py process=2803754 request_id=cmpl-71e9e27ce9f842108e3e820b1b6d63c8
2024-01-31 20:58:40 [info ] StagingInferenceEngine.cancel [mlc_serve.engine.staging_engine] func_name=cancel lineno=133 pathname=/opt/dlami/nvme/liteye/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=2803754 request_id=cmpl-71e9e27ce9f842108e3e820b1b6d63c8
2024-01-31 20:58:40 [info ] AsyncEngineConnector.generate request sucessfully cancelled. [mlc_serve.engine.async_connector] func_name=generate lineno=93 pathname=/opt/dlami/nvme/liteye/mlc-llm/serve/mlc_serve/engine/async_connector.py process=2803754 request_id=cmpl-71e9e27ce9f842108e3e820b1b6d63c8
2024-01-31 20:58:40 [info ] AsyncEngineConnector.generate removing request from result queue. [mlc_serve.engine.async_connector] func_name=generate lineno=98 pathname=/opt/dlami/nvme/liteye/mlc-llm/serve/mlc_serve/engine/async_connector.py process=2803754 request_id=cmpl-71e9e27ce9f842108e3e820b1b6d63c8
Hmm interesting. That is pretty much what I did. I was printing the all the token_ids and saw it kept printing with new tokens even after cancellation. Is it possible that the request is cancelled correctly but somehow keep printing from the buffer?
Is it possible that the request is cancelled correctly but somehow keep printing from the buffer?
No it's not. If it's cancelled correctly, it shouldn't be able to print new tokens.
Can you show me your steps to trigger the problem? Then I can try to reproduce it on my side.
User side cancellation does not take effect. We also need to log properly when it has been cancelled.