Open hiro-v opened 4 months ago
Currently, tensorrt_llm with tries to allocate as much as possible the VRAM consumption with 3 portions: https://nvidia.github.io/TensorRT-LLM/memory.html
Please add free_gpu_memory_fraction - https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L169-L172 as nitro parameter so that we can control it (of course the machine has to have enough VRAM for weight loading but we can reduce VRAM for other portions.
free_gpu_memory_fraction
This would let more people with GPU VRAM constraint be able to use tensorrt_llm
Relevant document ![Uploading image.png…]()
Currently, tensorrt_llm with tries to allocate as much as possible the VRAM consumption with 3 portions: https://nvidia.github.io/TensorRT-LLM/memory.html
Please add
free_gpu_memory_fraction
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L169-L172 as nitro parameter so that we can control it (of course the machine has to have enough VRAM for weight loading but we can reduce VRAM for other portions.This would let more people with GPU VRAM constraint be able to use tensorrt_llm