janhq / cortex.tensorrt-llm

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
https://cortex.jan.ai/docs/cortex-tensorrt-llm
Apache License 2.0
39 stars 2 forks source link

feat: TensorRT-LLM Inflight batching #29

Open tikikun opened 7 months ago

tikikun commented 7 months ago

Relevant docs can be found here.

https://nvidia.github.io/TensorRT-LLM/batch_manager.html#get-and-send-callbacks

Inflight batching is the most beneficial feature in CUDA system for LLM inferencing right now it can enable very high throughput.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."