Open abpani opened 2 months ago
Does the OOM error occurs when loading the model? Having the full trace back always helps
A 24b parameters model requires 48 GB of memory to load in bfloat16. Unless using memory shar, you need this amount of memory in all your GPUs (you currently only have 24GB). You probably need to distribute the weights, see DeepSpeed
@qgallouedec No it does not I am using bnb 4 bit to load the model. as you can see above it only takes 9 gb and is distributed among 4 gpus.
Can you please provide your system info and the full traceback?
accelerate 0.34.2 bitsandbytes 0.43.3 flash-attn 2.6.1 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 torch 2.4.1 trl 0.10.1 transformers 4.44.2
How can I get a traceback? Can you please help with something. I had never had this issue. I think it happens only with models where the model vocab size is more like 131k or 130k
It does not happen with mistral v0.2 or 0.3 as it has 32k vocab size.
The traceback is the error message, example:
1/0
traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
I dont get any errors though. There is no error with batch size of 1. I get oom error when I do more than 1. Please let me know if you need that.
Even I tried with deepspeed and accelerate I cant do more than batch size of 1.
Thanks for these elements. I do need it to see where the oom error occurs. It's always a good practice to include the traceback.
Now I can do batch size of 2 but the last gpu is almost full. It may give oom on a specific sample while fine tuning. I will share that when I get it. Thank you.
You suggested to try deepspeed. I tried but my acceleartor.process_index shows only 0 gpu so the model gets loaded to GPU 0 only. then it raises OOM error with batch size of more than 1. DEEPSPEED_CONFIG = { "fp16": { "enabled": True }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": False }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "gather_16bit_weights_on_model_save": True, "round_robin_gradients": True }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 10, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": False } deepspeed_plugin = DeepSpeedPlugin( hf_ds_config=DEEPSPEED_CONFIG )
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin, mixed_precision="fp16")
each GPU creates a string message=[ f"Hello this is GPU {accelerator.process_index}" ]
collect the messages from all GPUs messages=gather_object(message)
output the messages only on the main process with accelerator.print() accelerator.print(messages)
['Hello this is GPU 0']
Traceback (most recent call last):
File "/home/ubuntu/abpani/FundName/llama3_8b_qlora-bnb.py", line 94, in
Now I just tried with Mistral v0.3 Instruct with
per_device_train_batch_size = 6, per_device_eval_batch_size = 6, gradient_accumulation_steps = 8,
It is working great.
I can't reproduce. This the GPU usage I get:
GPU Memory used ~12GB
@qgallouedec I am trying with 8192 context length. I tried installing a lot of package combinations still same issue using models with large vocab size.
I tried installing a lot of package combinations
Thanks for contributing to this thread, but you have to understand that if we are to find an explanation and solve this problem, we need you to be able to reproduce it. Please provide a minimal example code, with the minimal combination of packages installed in their most recent versions. For the moment it's still impossible for me to reproduce.
System Info
Hello I am trying to load Mistral-Nemo Instruct-2407 in bnb 4bit on 4 A10 gpus on ec2 instance. I upgraded all the packages. Still I face cuda memory out of error when train batch size is more than 1. Cant fine tune a model with even batch size of 2. The model gets loaded like below with AutomodelforcasualLM gpu usage when the batch size is 1
Information
Tasks
examples
folderReproduction
import sys, gc, torch, random, os import numpy as np import pandas as pd import time from datasets import load_dataset, Dataset, DatasetDict from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, BitsAndBytesConfig, Qwen2ForCausalLM from trl import SFTConfig, SFTTrainer from prepare_data import PrepareData
CONTEXT_LENGTH = 4096 output_dir = "outputs_mi"
model_id = "Nemo-Instruct" if torch.cuda.get_device_capability()[0] >= 8: torch_dtype = torch.bfloat16 attn_implementation = "flash_attention_2" else: torch_dtype = torch.float16 attn_implementation = "eager"
tokenizer = AutoTokenizer.from_pretrained(model_id, max_seq_length = CONTEXT_LENGTH) tokenizer.padding_side = 'right' tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, )
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config = bnb_config, device_map = 'auto', attn_implementation=attn_implementation)
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj'] )
prepare_data = PrepareData(json_file="less_freq_fund_data_train.jsonl") dataset = prepare_data.prepare_sft_chat_data() print(dataset)
training_arguments = SFTConfig( output_dir = output_dir, dataset_text_field="text", max_seq_length=CONTEXT_LENGTH, num_train_epochs = 10, overwrite_output_dir = True, per_device_train_batch_size = 1, per_device_eval_batch_size = 1, gradient_accumulation_steps = 2, optim = "paged_adamw_8bit", save_strategy = 'epoch',
save_steps = 500,
)
trainer = SFTTrainer( model = model, train_dataset = dataset['train'], eval_dataset = dataset['test'], peft_config = peft_config, tokenizer = tokenizer, args = training_arguments, packing = False, )
trainer.train() trainer.save_model(output_dir)
Flush memory
del trainer, model gc.collect() gc.collect() torch.cuda.empty_cache()
Expected behavior
on 96 Gb gpu memory I should fine tuning with more than 1 batch size.