neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.98k stars 172 forks source link

[ContinuousBatching] ContinuousBatchingScheduler Implementation #1375

Closed bfineran closed 10 months ago

bfineran commented 10 months ago

Uses the utils in #1373 and #1374 to implement a scheduler for continuous batching. The ContinuousBatchingScheduler tracks various EngineOperators and manages their input queues with ContinuousBatchingQueues. The scheduler also runs multiple ContinuousBatchingExecutorThreads in parallel that consume these queues and actually run the multi-batch engine and return the correct futures from the scheduler submit.

next steps include:

test_plan: simple single execution unit test included, further tests should test multiple engines/operators/batch sizes with sufficient load to trigger multibatch execution - note that unit tests for multibatch are handled with the helpers

dsikka commented 10 months ago

One more question: in terms of next steps, you had written down: integration with KV Cache engine mode for text gen Any reason this can't work with the NLEngineOperator as is currently? The NLEngineOperator inherits from the EngineOperator

@bfineran

bfineran commented 10 months ago

One more question: in terms of next steps, you had written down: integration with KV Cache engine mode for text gen Any reason this can't work with the NLEngineOperator as is currently? The NLEngineOperator inherits from the EngineOperator

@bfineran

yeah a few things here:

  1. the schemas need to implement the split/join since they don't inherit
  2. I think we'll need to update the way the engine kwargs are passed so the shared create_engine function sets the right internal/external kv cache mode
  3. the run function needs to get updated to accept an engine to be swapped out like we do in EngineOperator