Closed OKC13 closed 5 months ago
To effectively utilize multiple GPUs and manage memory constraints, you can implement Tensor Parallelism (TP) using NVIDIA's TensorRT LLM library. Here's a concise guide to get you started:
Configure Tensor Parallelism: In your model's configuration file, set the tensor_parallel
parameter within builder_config
to enable TP. This parameter dictates how the model's layers are distributed across GPUs.
Set TP Size: Adjust the tp_size
parameter to the number of GPUs you wish to use. For instance, if you're aiming to use 2 GPUs, set tp_size
to 2. This distributes the model's computation across the GPUs, helping to alleviate memory issues.
Ensure Runtime Configuration Matches: The runtime configuration must reflect the correct world_size
, which equals tp_size
(as Pipeline Parallelism isn't supported, pp_size
is 1). This ensures the model is distributed correctly.
Model Loading and Execution: The system automatically assigns the CUDA device based on the runtime rank and the number of GPUs per node. This ensures each model segment runs on its designated GPU.
Be Aware of Limitations: While TP can reduce the memory load per GPU, it introduces communication overhead. The effectiveness of TP will vary based on the model architecture, size, and GPU interconnect bandwidth.
Check Dependencies: Make sure all necessary libraries for distributed computing and NVIDIA TensorRT LLM are installed, including MPI libraries and NVIDIA drivers or SDKs.
By adjusting the tp_size
to match your available GPUs and ensuring your environment is correctly set up for TP, you can distribute your model across multiple GPUs to mitigate out-of-memory errors. For detailed implementation, refer to the llama-index-legacy/llama_index/legacy/llms/nvidia_tensorrt.py in the LlamaIndex repository.
@OKC13 Its not really possible right now, or at least not with llama-index
You'd have to load the model and tokenizer outside of llama-index, using huggingface and PyTorch. such that its multi-gpu (they have docs on this), and then pass it in
llm = HuggingFaceLLM(model=model, tokenizer=tokenizer, ...)
Question Validation
Question
my gpu is 24G memory,so, its possiable to spread models to 2 GPUs。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.50 GiB of which 7.19 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 23.26 GiB is allocated by PyTorch, and 1.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)