I have two GPUs with 24GB VRAM. By manually configuring the device_map, I can enable naive model parallelism to fine-tune a 72B quantized model using QLoRA on a short text dataset.
import os
os.environ['CUDA_VISIBLE_DEVICES'] = 0,1'
device_map = {}
device_map['model.embed_tokens'] = 0
for layer_idx in range(41):
device_map[f'model.layers.{layer_idx}'] = 0
for layer_idx in range(41, 80):
device_map[f'model.layers.{layer_idx}'] = 1
device_map['lm_head.weight'] = 1
device_map['model.norm.weight'] = 1
device_map['model.rotary_emb'] = 1
model = AutoModelForCausalLM.from_pretrained(./Qwen2-72B-Instruct-bnb-4bit,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map=device_map)
However, when dealing with slightly longer texts, I encounter OOM issues.
I tried using Unsloth, but it currently doesn’t support multi-GPU setups. It would be great if Unsloth plans to support naive model parallelism!
I have two GPUs with 24GB VRAM. By manually configuring the device_map, I can enable naive model parallelism to fine-tune a 72B quantized model using QLoRA on a short text dataset.
However, when dealing with slightly longer texts, I encounter OOM issues.
I tried using Unsloth, but it currently doesn’t support multi-GPU setups. It would be great if Unsloth plans to support naive model parallelism!