Model loading is uneven on GPUs with AutomodelforCasualLM

abpani commented 3 months ago

System Info

python 3.10.10 torch 2.3.1 transformers 4.43.2 optimum 1.17.1 auto_gptq 0.7.1 bitsandbytes 0.43.2 accelerate 0.33.0

Llama3.1 8B Instruct gets loaded like this. So I cant even go more than 1 batch size while finetuning Screenshot 2024-07-24 at 1 54 24 PM Screenshot 2024-07-24 at 1 57 39 PM

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import sys, gc, torch, random, os
import numpy as np
import pandas as pd
import time
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, BitsAndBytesConfig
from trl import SFTTrainer
import wandb
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
wandb.init(mode = 'disabled')

CONTEXT_LENGTH = 8192
output_dir = "outputs_mi"

model_id = "./llama_models/Meta-Llama-3.1-8B-Instruct-gptq-4bit/"

tokenizer = AutoTokenizer.from_pretrained(model_id, max_seq_length = CONTEXT_LENGTH)
tokenizer.add_eos_token = True
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id, device_map = 'auto')

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r = 64,
    lora_alpha = 64,
    target_modules=["k_proj","o_proj","q_proj","v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout = 0,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print(model.print_trainable_parameters())

data_files = {"train": "full_label_train_data.csv", "test":"full_label_test_data.csv"}
dataset = load_dataset("csv", data_files = data_files)

print(dataset)

training_arguments = TrainingArguments(
   output_dir = output_dir,
    num_train_epochs = 100,
    overwrite_output_dir = True,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    gradient_accumulation_steps = 4,
    optim = "paged_adamw_8bit",
    save_strategy = 'epoch',
    # save_steps = 500,
    warmup_ratio = 0.2,
    logging_steps = 2,
    learning_rate = 4e-4,
    # gradient_checkpointing=True,
    # gradient_checkpointing_kwargs={"use_reentrant": True},
    weight_decay = 0.001,
    fp16 = False,
    bf16 = True,
    max_steps= -1,
    max_grad_norm = 0.3, 
    group_by_length = True,
    lr_scheduler_type= "linear",
    use_cpu = False,
    report_to = "tensorboard",
    eval_strategy = "epoch"    
)

Expected behavior

I would like the model to be loaded evenly so that I can finetune with a larger batch size

LysandreJik commented 3 months ago

Have you tried playing with different parameters of the device_map?

You can read more about it and about customizing it here: https://huggingface.co/docs/transformers/big_models#accelerates-big-model-inference

cc @SunMarc I'm trying to find a doc that dives into the different attributes that device_map can accept but not finding any such docs in the transformers docs.

abpani commented 3 months ago

still same issue. it shows different errors like loaded in different devices . cuda 0 and cuda 1 device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.norm': 3, 'lm_head': 3}

abpani commented 3 months ago

@LysandreJik You can find the details about device map here https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/model.safetensors.index.json

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

abpani commented 2 months ago

@LysandreJik I tried as you suggested still same issue in a multi gpu environment.

SunMarc commented 2 months ago

Hey @abpani, the final allocation looks very strange indeed. Can you try with device_map = "sequential" and set max_memory ? Also what do you mean by it shows different errors like loaded in different device. Could you share the traceback ? Thanks !

abpani commented 2 months ago

@SunMarc funny thing is it does not happen with Mistral models. it works balanced for mistral models. But with qwen, phi, llama still same issue.

abpani commented 2 months ago

Hey @abpani, the final allocation looks very strange indeed. Can you try with device_map = "sequential" and set max_memory ? Also what do you mean by it shows different errors like loaded in different device. Could you share the traceback ? Thanks !

I dont have that currently. but still auto devicemap should work fine as it works perfectly with all mistral models.

ArthurZucker commented 2 months ago

Might just be the not_split_module or simply the sizes of the models

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 1 month ago

Closing as I believe you have the balanced option 🤗 updating the no_split module is also possible. You can never completely evenly split as the lm head is a lot bigger as a pure layer than say a mlp

huggingface / transformers