octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
https://mlc.ai/mlc-llm
Apache License 2.0
5 stars 8 forks source link

[Bug] Undeterministic Batch Formation #208

Open sunggg opened 7 months ago

sunggg commented 7 months ago

The default run for serve/tests/test_engine.py first adds 4 requests and then start the engine. I expected this would form single prefill batch with 4 requests. However, it shows non-deterministic behavior, sometimes it forms two prefill batches of 2 requests each, sometimes it forms two prefill batches of 1/3 requests each. Debug log does not provide any useful information.

sunggg commented 7 months ago

@elvin-n will take a look.

masahi commented 7 months ago

I think this is inherent to async request add / batch creation in staging engine. Not sure if we can make it deterministic @yelite

yelite commented 7 months ago

I think this is inherent to async request add / batch creation in staging engine. Not sure if we can make it deterministic @yelite

But that test adds requests one by one in a for loop, and the engine push requests to queues before they reach to the worker batch. I don't see anything obvious in that code path which would create undeterministic behavior.

masahi commented 7 months ago

By "async" I meant the request arrives to the worker via AddRequestsCommand and I think that's done asynchronously with worker.step()?

yelite commented 7 months ago

By "async" I meant the request arrives to the worker via AddRequestsCommand and I think that's done asynchronously with worker.step()?

Yes that's async and very likely to be the cause of undeterministic batch here. I don't have a good way to fix this without impacting the performance.

In #193 I plan to remove the sync engine, but add some flags to the staging engine to run things in a more synchronous way, so we can have more deterministic behavior in unit tests if it's necessary.

elvin-n commented 7 months ago

There are three use cases:

  1. Explicit creation of the SynchronousInferenceEngine and usage its method add/step directly. It will have completely determined behaviour and I have not seen that serve/tests/test_engine.py process different number of requests than 4.
  2. Usage of AsyncEngineConnector. It will create async point and even for SynchronousInferenceEngine behaviour cannot be deterministic
  3. Usage of StagingInferenceEngine in any mode adds one more point of async for submitting of requests in the second process

2 and 3 async points were added by design, helps in real situation and cannot be removed. The only predictable flow can be with Sync engine and explicit call of add/step.