smreddy05 commented 7 months ago

I am trying to fine tune gemma7-b with 4 A100 80 GB gpus using 4-bit qunatization model_id = "google/gemma-7b"

BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

print("initiating model download")

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map="auto", token=access_token) peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, target_modules=["q_proj", "v_proj"], r=64, bias="none", task_type="CAUSAL_LM", )

prepare model for training

model = prepare_model_for_kbit_training(model) model = get_peft_model(model, peft_config) from transformers import TrainingArguments args = TrainingArguments( output_dir=output_dir, num_train_epochs=15, per_device_train_batch_size=8, gradient_accumulation_steps=2,

gradient_checkpointing=True,

optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name  # disable tqdm since with packing values are in correct

) from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer( model=model, peft_config=peft_config, max_seq_length=max_seq_length, tokenizer=tokenizer, packing=True, formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset args=args, train_dataset=dataset["train"], eval_dataset=dataset["test"] ) trainer.train()

This is throwing ""ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}""

the same script works for other models like llama2

versions used : transformers:4.38.1 trl:0.7.11

younesbelkada commented 7 months ago

Hi @smreddy05 Thanks for the issue ! Can you try out the solution proposed here: https://github.com/huggingface/trl/issues/1348#issuecomment-1959028364

smreddy05 commented 7 months ago

@younesbelkada thanks for your suggestion and i am hitting new issue ""torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.86 GiB. GPU 0 has a total capacity of 79.15 GiB of which 5.08 GiB is free. Process 73494 has 74.06 GiB memory in use. Of the allocated memory 69.76 GiB is allocated by PyTorch, and 2.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

younesbelkada commented 7 months ago

@smreddy05 now you're facing a cuda OOM issue, can you try to use Flash Attention 2 or decrease the max_seq_len / batch_size ?

smreddy05 commented 7 months ago

Hey @younesbelkada , i was using flashattention from the moment I have faced 8-bit precision error and I tried reduing batch_size, still I am hitting same issue and the same code works for llama2. Not sure whats wrong with this. Will give it a try with previous versions of trl and accelerate. Also, I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this? really appreciate your help on this

younesbelkada commented 7 months ago

I suspect the reason why it worked for llama-2 is that llama has 6.74B parameters

Whereas gemma-7b has in reality ~8.5B parameters

You can also use gradient accumulation with very small batch size. For the error you are getting you need to update accelerate pip install -U accelerate

smreddy05 commented 7 months ago

@younesbelkada sorry for not being clear, i was referring to llama2-70B model and as of now I am on accelerate 0.27.2, trl=0.7.10 and I was using gradient_accumulation_steps=2,

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

VIS-WA commented 5 months ago

Hi @smreddy05! Were you able to find a solution to fix the OutOfMemoryError error? I have encountered a similar error where I am able to fine-tune llama2 13B but not gemma 7B (although I was using trainer from Transformers=4.41 library). This error occurs only when the evaluation is enabled (do_eval=True), setting it to False makes everything work like a charm.

smreddy05 commented 5 months ago

@VIS-WA, sorry, i haven't spent time on this. But, if we set do_eval=False then we cannot run any evaluation on validation set and due to this it might be tricky to judge how good fine tuned model is

huggingface / trl

8-bit precision error with fine tuning of gemma #1355

BitsAndBytesConfig int-4 config

prepare model for training

gradient_checkpointing=True,