unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.37k stars 1.28k forks source link

Is it possible for Unsloth to support naive model parallelism? #1305

Open Songjw133 opened 3 days ago

Songjw133 commented 3 days ago

I have two GPUs with 24GB VRAM. By manually configuring the device_map, I can enable naive model parallelism to fine-tune a 72B quantized model using QLoRA on a short text dataset.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = 0,1'

device_map = {}
device_map['model.embed_tokens'] = 0
for layer_idx in range(41):
    device_map[f'model.layers.{layer_idx}'] = 0
for layer_idx in range(41, 80):
    device_map[f'model.layers.{layer_idx}'] = 1
device_map['lm_head.weight'] = 1
device_map['model.norm.weight'] = 1
device_map['model.rotary_emb'] = 1
model = AutoModelForCausalLM.from_pretrained(./Qwen2-72B-Instruct-bnb-4bit,
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             device_map=device_map)

However, when dealing with slightly longer texts, I encounter OOM issues.

I tried using Unsloth, but it currently doesn’t support multi-GPU setups. It would be great if Unsloth plans to support naive model parallelism!