[Performance]: Splitting model across GPUs with varying vRAM

Proposal to improve performance

I have 8 GPUs across two nodes (4 and 4). I have 4 3090s on one and 3 3090s on the other, along with a 3080. 3090s have 24gb of vram while 3080s only have 12. Thus, when loading in a large model such as llama3 70b, which splits the model so it takes up ~16gb per GPU, I get an OOM error. We can also take another, slightly smaller model as an example too so long as it ends up splitting it >~12gb.

I have found a few ways to navigate this, and thought it would be interesting to bring up and see if it could/would ever be implemented in the future.

Using `accelerate`

Accelerate has a util called get_balanced_memory which computes a max_memory dictionary for infer_auto_device_map when loading a model into memory for inferrence. This automatically calculates how to split up the model if there are multiple GPUs with varying amounts of vram. It can also be manually set.

Manually getting GPU memory

import torch
for i in range(torch.cuda.device_count()):
    print(i, torch.cuda.get_device_properties(i).total_memory)

This can then be used to set a custom device_map, or for the following alternatives to accelerate

Using `torch`

You can calculate the number of layers in a model and use the gpu memory list like so (this is pretty poor but just a MVP)

import torch
import torch.nn as nn
from transformers import AutoConfig, AutoTokenizer
from transformers.models.mistral.modeling_mistral import MistralForCausalLM

# Step 1: Calculate the memory of each GPU
def get_gpu_memory():
    gpu_memory = []
    num_gpus = torch.cuda.device_count()
    for i in range(num_gpus):
        props = torch.cuda.get_device_properties(i)
        gpu_memory.append(props.total_memory)
    return gpu_memory

# Step 2: Manually split the model layers across GPUs
class DistributedModel(nn.Module):
    def __init__(self, model, gpu_memory):
        super(DistributedModel, self).__init__()
        self.gpu_layers = nn.ModuleList()
        total_memory = sum(gpu_memory)
        proportions = [mem / total_memory for mem in gpu_memory]

        layers = list(model.model.layers.children())
        num_layers = len(layers)
        layers_per_gpu = [int(p * num_layers) for p in proportions]

        # Adjust to make sure the total layers assigned equals num_layers
        diff = num_layers - sum(layers_per_gpu)
        for i in range(diff):
            layers_per_gpu[i % len(layers_per_gpu)] += 1

        # Allocate layers to GPUs
        current_layer = 0
        for i, num in enumerate(layers_per_gpu):
            device = torch.device(f'cuda:{i}')
            gpu_layers = layers[current_layer:current_layer + num]
            self.gpu_layers.append(nn.Sequential(*gpu_layers).to(device))
            current_layer += num

        self.embedding = model.model.embed_tokens.to(torch.device('cuda:0'))
        self.ln_f = model.model.norm.to(torch.device(f'cuda:{len(gpu_memory) - 1}')))
        self.head = model.lm_head.to(torch.device(f'cuda:{len(gpu_memory) - 1}'))

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        for i, layers in enumerate(self.gpu_layers):
            device = torch.device(f'cuda:{i}')
            x = x.to(device)
            for layer in layers:
                x = layer(x)
                if isinstance(x, tuple):
                    x = x[0]  # Ensure we are working with tensors, not tuples
        x = self.ln_f(x.to(torch.device(f'cuda:{len(self.gpu_layers) - 1}')))
        logits = self.head(x)
        return logits

# Load configuration and tokenizer
model_name = "unsloth/Phi-3-mini-4k-instruct"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the model (without loading weights)
model = MistralForCausalLM(config)

# Get GPU memory and split the model
gpu_memory = get_gpu_memory()
distributed_model = DistributedModel(model, gpu_memory)

# Move tokenizer to the first GPU
device = torch.device('cuda:0')

# Run inference
input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs['input_ids'].to(device)

with torch.no_grad():
    outputs = distributed_model(input_ids)

# Decode the output
decoded_output = tokenizer.decode(outputs[0].argmax(dim=-1).tolist(), skip_special_tokens=True)
print(decoded_output)

I was curious if something like this would ever be implemented in the future- especially due the vLLM's awesome ability to scale up from a single GPU.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

vllm-project / vllm