Open masahi opened 8 months ago
@elvin-n After https://github.com/octoml/mlc-llm/pull/157 lands, you can follow a similar strategy to use multiple EvalMultiQueryRequest
to split restoring of a long request into several batches, each of which fits into max_num_batched_tokens
.
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L385-L399
For streaming case, we cannot clamp the generated tokens and recompute them. Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See https://github.com/octoml/mlc-llm/pull/158 and https://github.com/octoml/mlc-llm/issues/164.
We need to either
max_num_batched_tokens
evaluate_multi_query
function from https://github.com/octoml/mlc-llm/pull/156@elvin-n @sunggg