octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
https://mlc.ai/mlc-llm
Apache License 2.0
5 stars 8 forks source link

[Bug] Recovering logic of a long evicted request is broken #163

Open masahi opened 8 months ago

masahi commented 8 months ago

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L385-L399

For streaming case, we cannot clamp the generated tokens and recompute them. Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See https://github.com/octoml/mlc-llm/pull/158 and https://github.com/octoml/mlc-llm/issues/164.

We need to either

@elvin-n @sunggg

masahi commented 7 months ago

@elvin-n After https://github.com/octoml/mlc-llm/pull/157 lands, you can follow a similar strategy to use multiple EvalMultiQueryRequest to split restoring of a long request into several batches, each of which fits into max_num_batched_tokens.