huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.6k stars 1.2k forks source link

Uneven model loading #2079

Open abpani opened 2 weeks ago

abpani commented 2 weeks ago

System Info

Hello I am trying to load Mistral-Nemo Instruct-2407 in bnb 4bit on 4 A10 gpus on ec2 instance. I upgraded all the packages. Still I face cuda memory out of error when train batch size is more than 1. Cant fine tune a model with even batch size of 2. The model gets loaded like below with AutomodelforcasualLM image gpu usage when the batch size is 1 image

Information

Tasks

Reproduction

import sys, gc, torch, random, os import numpy as np import pandas as pd import time from datasets import load_dataset, Dataset, DatasetDict from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, BitsAndBytesConfig, Qwen2ForCausalLM from trl import SFTConfig, SFTTrainer from prepare_data import PrepareData

CONTEXT_LENGTH = 4096 output_dir = "outputs_mi"

model_id = "Nemo-Instruct" if torch.cuda.get_device_capability()[0] >= 8: torch_dtype = torch.bfloat16 attn_implementation = "flash_attention_2" else: torch_dtype = torch.float16 attn_implementation = "eager"

tokenizer = AutoTokenizer.from_pretrained(model_id, max_seq_length = CONTEXT_LENGTH) tokenizer.padding_side = 'right' tokenizer.pad_token = tokenizer.eos_token

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, )

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config, device_map = 'auto', attn_implementation=attn_implementation)

model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj'] )

prepare_data = PrepareData(json_file="less_freq_fund_data_train.jsonl") dataset = prepare_data.prepare_sft_chat_data() print(dataset)

training_arguments = SFTConfig( output_dir = output_dir, dataset_text_field="text", max_seq_length=CONTEXT_LENGTH, num_train_epochs = 10, overwrite_output_dir = True, per_device_train_batch_size = 1, per_device_eval_batch_size = 1, gradient_accumulation_steps = 2, optim = "paged_adamw_8bit", save_strategy = 'epoch',

save_steps = 500,

bf16=True,
warmup_ratio = 0.3,
logging_steps = 1,
learning_rate = 4e-5,
gradient_checkpointing=True,
weight_decay = 0.01,
max_steps= -1,
max_grad_norm = 1, 
group_by_length = True,
lr_scheduler_type= "linear",
use_cpu = False,
report_to = "tensorboard",
eval_strategy = "epoch",

)

trainer = SFTTrainer( model = model, train_dataset = dataset['train'], eval_dataset = dataset['test'], peft_config = peft_config, tokenizer = tokenizer, args = training_arguments, packing = False, )

trainer.train() trainer.save_model(output_dir)

Flush memory

del trainer, model gc.collect() gc.collect() torch.cuda.empty_cache()

Expected behavior

on 96 Gb gpu memory I should fine tuning with more than 1 batch size.

qgallouedec commented 2 weeks ago

Does the OOM error occurs when loading the model? Having the full trace back always helps

A 24b parameters model requires 48 GB of memory to load in bfloat16. Unless using memory shar, you need this amount of memory in all your GPUs (you currently only have 24GB). You probably need to distribute the weights, see DeepSpeed

abpani commented 2 weeks ago

@qgallouedec No it does not I am using bnb 4 bit to load the model. as you can see above it only takes 9 gb and is distributed among 4 gpus.

qgallouedec commented 2 weeks ago

Can you please provide your system info and the full traceback?

abpani commented 2 weeks ago

accelerate 0.34.2 bitsandbytes 0.43.3 flash-attn 2.6.1 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 torch 2.4.1 trl 0.10.1 transformers 4.44.2

How can I get a traceback? Can you please help with something. I had never had this issue. I think it happens only with models where the model vocab size is more like 131k or 130k

It does not happen with mistral v0.2 or 0.3 as it has 32k vocab size.

qgallouedec commented 2 weeks ago

The traceback is the error message, example:

1/0

traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
abpani commented 2 weeks ago

I dont get any errors though. There is no error with batch size of 1. I get oom error when I do more than 1. Please let me know if you need that.

abpani commented 2 weeks ago

Even I tried with deepspeed and accelerate I cant do more than batch size of 1.

qgallouedec commented 2 weeks ago

Thanks for these elements. I do need it to see where the oom error occurs. It's always a good practice to include the traceback.

abpani commented 2 weeks ago

Screenshot 2024-09-19 at 9 24 28 AM Now I can do batch size of 2 but the last gpu is almost full. It may give oom on a specific sample while fine tuning. I will share that when I get it. Thank you.

abpani commented 2 weeks ago

You suggested to try deepspeed. I tried but my acceleartor.process_index shows only 0 gpu so the model gets loaded to GPU 0 only. then it raises OOM error with batch size of more than 1. DEEPSPEED_CONFIG = { "fp16": { "enabled": True }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": False }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "gather_16bit_weights_on_model_save": True, "round_robin_gradients": True }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 10, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": False } deepspeed_plugin = DeepSpeedPlugin( hf_ds_config=DEEPSPEED_CONFIG )

accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin, mixed_precision="fp16")

each GPU creates a string message=[ f"Hello this is GPU {accelerator.process_index}" ]

collect the messages from all GPUs messages=gather_object(message)

output the messages only on the main process with accelerator.print() accelerator.print(messages)

['Hello this is GPU 0']

abpani commented 2 weeks ago

This is the traceback when I do batchsize 3.

Traceback (most recent call last): File "/home/ubuntu/abpani/FundName/llama3_8b_qlora-bnb.py", line 94, in trainer.train() File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 450, in train output = super().train(args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in training_step loss = self.compute_loss(model, inputs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/transformers/trainer.py", line 3363, in compute_loss outputs = model(inputs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward return model_forward(args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast return func(*args, *kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/peft/peft_model.py", line 1577, in forward return self.base_model( File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 188, in forward return self.model.forward(*args, *kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, kwargs) File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1053, in forward shift_logits = logits[..., :-1, :].contiguous() torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.00 GiB. GPU 3 has a total capacity of 22.19 GiB of which 4.05 GiB is free. Including non-PyTorch memory, this process has 18.13 GiB memory in use. Of the allocated memory 13.80 GiB is allocated by PyTorch, and 4.02 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 0%| | 0/4120 [00:54<?, ?it/s]

abpani commented 2 weeks ago

Now I just tried with Mistral v0.3 Instruct with

per_device_train_batch_size = 6, per_device_eval_batch_size = 6, gradient_accumulation_steps = 8,

Screenshot 2024-09-19 at 10 17 32 AM

It is working great.