mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.67k stars 1.52k forks source link

[Question] Why does the server ignore my request under speculative decoding? #2852

Closed Erxl closed 1 week ago

Erxl commented 3 weeks ago

❓ General Questions

main model: mistral-large-instruct-2407-q4f16_1 draft model: Mistral-7B-Instruct-v0.3-q4f16_1-MLC

I cannot use speculative decoding on my AMD GPU server. The server is running, but there is no response to any chat requests, and there are no any error outputs. no output similar to INFO: 192.168.1.4:34425 - "POST /v1/chat/completions HTTP/1.1" 200 OK. I have already updated ROCm to 6.2 and installed the latest pre-built mlcllm Python package.

MasterJH5574 commented 2 weeks ago

but there is no response to any chat requests

Hi @Erxl do you mind providing a bit more context? Particularly it would be helpful if you can share some example code which you feel there is no response. On the other hand, if you don't use speculative decoding, does the server work well?

Erxl commented 1 week ago

@MasterJH5574 我已经解决了这个问题,使用使用server mode代替默认的local mode就可以解决。

MasterJH5574 commented 1 week ago

@Erxl Thanks for the update. Glad that it works out. Yes the local mode has a limited max batch size setting, and the speculative decoding won't enabled very effectively.