GPU Allocation Issue (QLoRa + Llama3-8B-IT)

System Info

peft: 0.10.1.dev0 accelerate: 0.30.0 bitsandbytes: 0.43.1 transformers: 4.39.3 GPU: A6000 * 2 ( 96GB ) nvidia-driver version: 535.171.04 cuda: 11.8

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[x] My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

This is my script.

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

train_args_trainer = TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-7b-4bit-lora-test2",
    optim = "paged_adamw_32bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-7b-4bit-lora-test2",
    remove_unused_columns=False
)

model.is_parallelizable = True
model.model_parallel = True

trainer = Trainer(
    model = model,
    tokenizer = tok,
    args = train_args_trainer,
    train_dataset = toknized_proc_data['train'],
    data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)

trainer.train()

Wed May  8 06:54:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

Expected behavior

I want the GPUs to be evenly allocated.

huggingface / peft