janhq / cortex.tensorrt-llm

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
https://cortex.jan.ai/docs/cortex-tensorrt-llm
Apache License 2.0
40 stars 2 forks source link

feat: Ultilize `free_gpu_memory_fraction` to control max VRAM consumption #25

Closed hiro-v closed 3 months ago

hiro-v commented 8 months ago

Currently, tensorrt_llm with tries to allocate as much as possible the VRAM consumption with 3 portions: https://nvidia.github.io/TensorRT-LLM/memory.html

Please add free_gpu_memory_fraction - https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L169-L172 as nitro parameter so that we can control it (of course the machine has to have enough VRAM for weight loading but we can reduce VRAM for other portions.

This would let more people with GPU VRAM constraint be able to use tensorrt_llm

tikikun commented 8 months ago
image

Relevant document ![Uploading image.png…]()

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."