Merge LoRA into 405B - Githubissues

junzhang-zj commented 2 months ago

Env: 8 A100-80G GPUs, Transformer: 4.43.3, torch: 2.4.0+cu121

Goal: Get a merged INT8 405B

Path 1: 1) Load IN8 405B & BF16 LoRA --> Merge --> Save

Path 2: 1) Load BF16 405B & BF16 LoRA --> Merge --> Save 2) Load INT8 merged 405B --> Save

When I tried to merge LoRA into LLaMA-3.1 405B with load_in_8bit=True, I got an error in BNB; so I tried to use CPU and merge 405B with bf16, and it was still killed suddenly.

BenjaminBossan commented 2 months ago

Could you please provide more information: What code did you use to merge? What is the full error message that you got for both tests you did?

junzhang-zj commented 2 months ago

Thanks, for your prompt reply!

For case 2, no error was received and it was killed.

Guessed reason: The process was killed because the memory was full, sometimes it was killed during merge, sometimes it was killed during the save, and only part of the weights were saved.

My test code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import torch
from peft import PeftModel
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "llama3/Meta-Llama-3.1-405B"
lora_path = ""
output_path = ""

base_model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    load_in_8bit=False,
    torch_dtype=torch.bfloat16, 
    device_map={"": "cpu"} 
)

lora_model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    torch_dtype=torch.bfloat16,
    device_map={"": "cpu"}
)

model = lora_model.merge_and_unload(progressbar=True)
model.save_pretrained(output_path) 
base_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
base_tokenizer.save_pretrained(output_path)

junzhang-zj commented 2 months ago

For case 1, I made a device_map allocation in advance.

Return: Error an illegal memory access was encountered at line 524 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

Test Script:

num_gpus = 8
num_layers = 126

from collections import OrderedDict
device_map = OrderedDict()

# Specify how many layers GPU 0 should handle
layers_for_gpu0 = num_layers // num_gpus // 2
remaining_layers = num_layers - layers_for_gpu0
layers_per_remaining_gpu = remaining_layers // (num_gpus - 1)
extra_layers = remaining_layers % (num_gpus - 1)

# Assign layers to GPU 0
for layer in range(layers_for_gpu0):
    device_map[f'model.layers.{layer}'] = 0

# Distribute remaining layers across the other 7 GPUs
for i in range(1, num_gpus):
    start_layer = layers_for_gpu0 + (i - 1) * layers_per_remaining_gpu + min(i - 1, extra_layers)
    end_layer = layers_for_gpu0 + i * layers_per_remaining_gpu + min(i, extra_layers)
    for layer in range(start_layer, end_layer):
        device_map[f'model.layers.{layer}'] = i

        # Ensure LoRA modules for each layer are on the same GPU
        for lora_module in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'mlp.gate_proj', 'mlp.up_proj', 'mlp.down_proj']:
            device_map[f'model.layers.{layer}.{lora_module}.lora_A'] = i
            device_map[f'model.layers.{layer}.{lora_module}.lora_B'] = i
            device_map[f'model.layers.{layer}.{lora_module}.lora_embedding_A'] = i
            device_map[f'model.layers.{layer}.{lora_module}.lora_embedding_B'] = i

device_map['model.embed_tokens'] = 0
device_map['model.norm'] = 0
device_map['lm_head'] = 0
device_map['model.lm_head.lora_A'] = 0
device_map['model.lm_head.lora_B'] = 0
device_map['model.lm_head.lora_embedding_A'] = 0
device_map['model.lm_head.lora_embedding_B'] = 0

model_name_or_path = "llama3/Meta-Llama-3.1-405B"
lora_path = ""
output_path = ""

base_model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
load_in_8bit=True,
torch_dtype=torch.bfloat16,
device_map=device_map
)

lora_model = PeftModel.from_pretrained(
base_model,
lora_path,
torch_dtype=torch.bfloat16,
device_map=device_map
)

model = lora_model.merge_and_unload(progressbar=True)
model.save_pretrained(output_path) 
base_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
base_tokenizer.save_pretrained(output_path)

BenjaminBossan commented 2 months ago

Thanks for the additional information. Could you please paste the full error message (stack trace) for case 1? Also, if you monitor memory, do you observe that there might not be enough memory for the operation? Merging incurs a slight memory overhead.

Unfortunately, this is not my area of expertise and I don't have a setup to test this. Pinging @matthewdouglas and @Titus-von-Koeller in case they can help with this.

junzhang-zj commented 2 months ago

@BenjaminBossan Thanks, this is the full error message:


Unloading and merging model:   0%|          | 0/2654 [00:00<?, ?it/s]/home/oppoer/.local/lib/python3.10/site-packages/peft/tuners/lora/bnb.py:83: UserWarning: Merge lora module to 8-bit linear may get different generations due to rounding errors.
  warnings.warn(

Unloading and merging model:   0%|          | 7/2654 [00:02<12:46,  3.45it/s]
Unloading and merging model:   0%|          | 9/2654 [00:02<11:27,  3.85it/s]
Unloading and merging model:   0%|          | 11/2654 [00:02<10:29,  4.20it/s]
Unloading and merging model:   0%|          | 13/2654 [00:04<18:46,  2.34it/s]
Unloading and merging model:   1%|          | 17/2654 [00:10<39:43,  1.11it/s]
Unloading and merging model:   1%|          | 19/2654 [00:15<56:25,  1.28s/it]
Unloading and merging model:   1%|          | 20/2654 [00:19<42:35,  1.03it/s]
Error during merge: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When passing CUDA_LAUNCH_BLOCKING=1

Unloading and merging model:   0%|          | 7/2654 [00:02<14:22,  3.07it/s]
Unloading and merging model:   0%|          | 9/2654 [00:02<12:45,  3.46it/s]
Unloading and merging model:   0%|          | 11/2654 [00:03<11:36,  3.80it/s]
Unloading and merging model:   0%|          | 13/2654 [00:05<21:23,  2.06it/s]
Unloading and merging model:   1%|          | 17/2654 [00:11<42:34,  1.03it/s]
Unloading and merging model:   1%|          | 19/2654 [00:16<1:01:40,  1.40s/it]
Error an illegal memory access was encountered at line 529 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

junzhang-zj commented 2 months ago

Error positioning: peft/utils/integrations.py im, imt, SCim, SCimt, coo_tensorim = bnb.functional.double_quant(im)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / peft

Merge LoRA into 405B #2065