run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.36k stars 4.98k forks source link

[Question]: how to started with multi gpu #12834

Closed OKC13 closed 5 months ago

OKC13 commented 5 months ago

Question Validation

Question

my gpu is 24G memory,so, its possiable to spread models to 2 GPUs。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.50 GiB of which 7.19 MiB is free. Including non-PyTorch memory, this process has 23.48 GiB memory in use. Of the allocated memory 23.26 GiB is allocated by PyTorch, and 1.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

dosubot[bot] commented 5 months ago

To effectively utilize multiple GPUs and manage memory constraints, you can implement Tensor Parallelism (TP) using NVIDIA's TensorRT LLM library. Here's a concise guide to get you started:

  1. Configure Tensor Parallelism: In your model's configuration file, set the tensor_parallel parameter within builder_config to enable TP. This parameter dictates how the model's layers are distributed across GPUs.

  2. Set TP Size: Adjust the tp_size parameter to the number of GPUs you wish to use. For instance, if you're aiming to use 2 GPUs, set tp_size to 2. This distributes the model's computation across the GPUs, helping to alleviate memory issues.

  3. Ensure Runtime Configuration Matches: The runtime configuration must reflect the correct world_size, which equals tp_size (as Pipeline Parallelism isn't supported, pp_size is 1). This ensures the model is distributed correctly.

  4. Model Loading and Execution: The system automatically assigns the CUDA device based on the runtime rank and the number of GPUs per node. This ensures each model segment runs on its designated GPU.

  5. Be Aware of Limitations: While TP can reduce the memory load per GPU, it introduces communication overhead. The effectiveness of TP will vary based on the model architecture, size, and GPU interconnect bandwidth.

  6. Check Dependencies: Make sure all necessary libraries for distributed computing and NVIDIA TensorRT LLM are installed, including MPI libraries and NVIDIA drivers or SDKs.

By adjusting the tp_size to match your available GPUs and ensuring your environment is correctly set up for TP, you can distribute your model across multiple GPUs to mitigate out-of-memory errors. For detailed implementation, refer to the llama-index-legacy/llama_index/legacy/llms/nvidia_tensorrt.py in the LlamaIndex repository.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 5 months ago

@OKC13 Its not really possible right now, or at least not with llama-index

You'd have to load the model and tokenizer outside of llama-index, using huggingface and PyTorch. such that its multi-gpu (they have docs on this), and then pass it in

llm = HuggingFaceLLM(model=model, tokenizer=tokenizer, ...)