Idefics2 fine-tuning: Error when unscale_gradients called on FP16 gradients during training with transformers and accelerate

rabiulcste commented 5 months ago

System Info

transformers version: 4.40.0.dev0
Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.17
Python version: 3.8.2
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.2
Accelerate version: 0.30.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): 2.13.1 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

During the training loop, when accelerator.clip_grad_norm_() is called, it leads to an unscale operation which fails because the gradients are in FP16. This error suggests a potential issue in handling gradient scaling with mixed precision settings.

 USE_LORA = True
 if USE_QLORA or USE_LORA:
        lora_config = LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            target_modules=".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$",
            use_dora=False if USE_QLORA else True,
            init_lora_weights="gaussian",
        )
        if USE_QLORA:
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
            )
        model = Idefics2ForConditionalGeneration.from_pretrained(
            args.model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            quantization_config=bnb_config if USE_QLORA else None,
        )
        if USE_LORA:
            model = model.to(DEVICE)
        model.add_adapter(lora_config)
        model.enable_adapters()

    else:
        model = Idefics2ForConditionalGeneration.from_pretrained(
            args.model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            _attn_implementation="flash_attention_2",  # Only available on A100 or H100
        ).to(DEVICE)
    print_trainable_parameters(model)

    training_args = TrainingArguments(
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size_per_device,
        # per_device_eval_batch_size=args.batch_size_per_device,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=50,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        lr_scheduler_type=args.lr_scheduler_type,
        logging_steps=10,
        log_level="info",
        output_dir=output_dir,
        save_strategy="steps",
        save_steps=200,
        # eval_steps=200,
        save_total_limit=10,
        # evaluation_strategy="steps",
        fp16=True,
        resume_from_checkpoint=True,
        push_to_hub_model_id=model_id,
        remove_unused_columns=False,
        report_to="all",
    )

Traceback (most recent call last):
File "runpy.py", line 193, in _run_module_as_main
return run_code(code, main_globals, None,
File "runpy.py", line 86, in run_code
exec(code, run_globals)
File "idefics2_fine_tuning.py", line 302, in <module>
main(args)
File "idefics2_fine_tuning.py", line 251, in main
trainer.train()
File "trainer.py", line 1858, in train
return inner_training_loop(
File "trainer.py", line 2248, in inner_training_loop
grad_norm = self.accelerator.clip_grad_norm(
File "accelerator.py", line 2254, in clip_grad_norm
self.unscale_gradients()
File "accelerator.py", line 2204, in unscale_gradients
self.scaler.unscale(opt)
File "grad_scaler.py", line 307, in unscale
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Expected behavior

This doesn't happen with QLora set to True. I'd expect the model to be fine-tuning without error.

amyeroberts commented 5 months ago

cc @pacman100 @muellerzr As error appears to be trainer + qlora related

khiemkhanh98 commented 5 months ago

same error

rabiulcste commented 5 months ago

I found a solution, remove torch_dtype, and it should work fine!

model = Idefics2ForConditionalGeneration.from_pretrained(
    args.model_name,
    device_map="auto",
    low_cpu_mem_usage=True,
    quantization_config=bnb_config if USE_QLORA else None,
)

samyak24jain commented 5 months ago

I'm facing the same issue with torch_dtype=torch.float16

I found a solution, remove torch_dtype, and it should work fine!

If torch_dtype=torch.float16 is removed, the model weights take double the memory to load. Is there anyway to train with fp16 weights and LoRA?

amyeroberts commented 4 months ago

cc @muellerzr @SunMarc

amyeroberts commented 3 months ago

Another ping @muellerzr @SunMarc

huggingface / transformers