[Bug] Recovering logic of a long evicted request is broken

octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

Apache License 2.0

5 stars 8 forks source link

For streaming case, we cannot clamp the generated tokens and recompute them. Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See https://github.com/octoml/mlc-llm/pull/158 and https://github.com/octoml/mlc-llm/issues/164.

We need to either

Require that generation never grows beyond max_num_batched_tokens
Split recovering of such requires into multiple batch, using the new evaluate_multi_query function from https://github.com/octoml/mlc-llm/pull/156

@elvin-n @sunggg

octoml / mlc-llm