Error on fine tuning paligemma for object detection

hadariru commented 1 month ago

System Info

transformers version: 4.41.2
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Python version: 3.11.9
Huggingface_hub version: 0.23.3
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr @SunMarc @amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Copy finetuning script for paligemma on text (https://huggingface.co/blog/paligemma)
Create detection dataset for paligemma model (https://github.com/ariG23498/ft-pali-gemma)
Train paligemma detection model on the detection dataset
adding evaluation part on training args

Adding this 3 extra arguments on step 4 causes evaluation to be performed.

    args=TrainingArguments(
                eval_accumulation_steps=4,
                per_device_eval_batch_size=1,
                eval_steps=10,
                eval_strategy=IntervalStrategy.STEPS,
            )

eval_dataloader is override with this

    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
        """
        Returns the evaluation [`~torch.utils.data.DataLoader`].

        Subclass and override this method if you want to inject some custom behavior.

        Args:
            eval_dataset (`torch.utils.data.Dataset`, *optional*):
                If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted
                by the `model.forward()` method are automatically removed. It must implement `__len__`.
        """
        if eval_dataset is None and self.eval_dataset is None:
            raise ValueError("Trainer: evaluation requires an eval_dataset.")

        # If we have persistent workers, don't do a fork bomb especially as eval datasets
        # don't change during training
        if hasattr(self, "_eval_dataloader") and self.args.dataloader_persistent_workers:
            return self.accelerator.prepare(self._eval_dataloader)
        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
        data_collator = partial(self.data_collator, train=False)

        if is_datasets_available() and isinstance(eval_dataset, datasets.Dataset):
            eval_dataset = self._remove_unused_columns(eval_dataset, description="evaluation")
        else:
            data_collator = self._get_collator_with_removed_columns(data_collator, description="evaluation")

        dataloader_params = {
            "batch_size": self.args.eval_batch_size,
            "collate_fn": data_collator,
            "num_workers": self.args.dataloader_num_workers,
            "pin_memory": self.args.dataloader_pin_memory,
            "persistent_workers": self.args.dataloader_persistent_workers,
        }

        if not isinstance(eval_dataset, torch.utils.data.IterableDataset):
            dataloader_params["sampler"] = self._get_eval_sampler(eval_dataset)
            dataloader_params["drop_last"] = self.args.dataloader_drop_last
            dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor

        # accelerator.free_memory() will destroy the references, so
        # we need to store the non-prepared version
        eval_dataloader = DataLoader(eval_dataset, **dataloader_params)
        if self.args.dataloader_persistent_workers:
            self._eval_dataloader = eval_dataloader

the data_collator is adapted from step 2

def collate_fn(examples, image_title, prompt, suffix_title, processor, device, train):
    images = [example[image_title].convert("RGB") for example in examples]

    prompt = [prompt for _ in examples]
    if train:
        suffix = [example[suffix_title] for example in examples]
    else:
        suffix = None

    # Help from: https://github.com/huggingface/transformers/issues/30987
    inputs = processor(
        images=images,
        text=prompt,
        suffix=suffix,
        return_tensors="pt",
        padding="longest",
    )

    inputs = inputs.to(torch.bfloat16).to(device)
    return inputs

collate_fn_trainer = partial(collate_fn, 
                                 image_title="image", 
                                 prompt="detect table", 
                                 suffix_title="paligemma_label", 
                                 processor=processor, 
                                 device=device)
trainer = PaliGemmaImageTrainer(
            model=model,
            train_dataset=train_dataset,
            eval_dataset=validation_dataset,
            data_collator=collate_fn_trainer,
            args=args,
            compute_metrics=compute_metrics
            )

Error logs:

{'loss': 6.2432, 'grad_norm': 3.953125, 'learning_rate': 1.99046483909416e-05, 'epoch': 0.12}
1%|▌ | 10/1680 [00:22<1:03:16, 2.27s/itTraceback (most recent call last): | 3/75 [00:00<00:04, 17.52it/s] File "/share/personal/darwin/repos/python/trainer/object_detection_ft.py", line 360, in trainer.train() File "xxx/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "xxx/lib/python3.11/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "xxx/lib/python3.11/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "xxx/lib/python3.11/site-packages/transformers/trainer.py", line 3572, in evaluate output = eval_loop( ^^^^^^^^^^ File "xxx/lib/python3.11/site-packages/transformers/trainer.py", line 3812, in evaluation_loop del losses, logits, labels, inputs ^^^^^^ UnboundLocalError: cannot access local variable 'losses' where it is not associated with a value 1%| | 10/1680 [00:24<1:07:22, 2.42s/it]

Expected behavior

No error on evaluation (losses should exist, I think)

amyeroberts commented 1 month ago

cc @molbap

molbap commented 1 month ago

Thanks for the issue @hadariru - just one note, it looks like the fine-tuning itself is working (ie if you let loss go down and don't add eval), it's the evaluation part in Trainer that has an issue? Seems the only way for losses to be not accessed would be prediction_step failing. cc @muellerzr in case you are familiar, will take a look at this soon

hadariru commented 1 month ago

@molbap Yes, the evaluation part is giving me error. Training itself is working fine. I can see finetune is working okay. (I checked by running prediction on the training data)

SangbumChoi commented 1 month ago

@muellerz @molbap @hadariru I think this happens because trainer accept the case when loss is None.

https://github.com/huggingface/transformers/blob/ab0f050b42d903f34d6eb97f3f8c0c07f0517ad2/src/transformers/trainer.py#L3765

when the loss is None and when you want to compute the metrics losses is not defined due to gather function for None in multi-gpu is useless. So you cannot del the losses variable since it has not been defined.

SangbumChoi commented 1 month ago

I think there are two ways to make this work

@hadariru Make sure that Paligemma returns the appropriate losses value (check if you set appropriate arguements)
@muellerz Or we can also set if else statement to the trainer for checking if that value can be deleted.

hadariru commented 1 month ago

@SangbumChoi this is the model that I used

    model = PaliGemmaForConditionalGeneration.from_pretrained(
        object_detection_config.MODEL_ID,
        torch_dtype=object_detection_config.MODEL_DTYPE,
        device_map=device,
        revision=object_detection_config.MODEL_REVISION,
    )

I tried to backtrack the reason why loss is None. I found out that self.label_names and loss_without_labels when it is evaluating is [] and False

I am not sure on what value to give or how to set label_names on trainer

hadariru commented 1 month ago

changing data_collator = partial(self.data_collator, train=False) -> data_collator = partial(self.data_collator, train=True) on the get_eval_dataloader

gives me this error

Traceback (most recent call last):
  File "xxx", line 361, in <module>
    trainer.train()
  File "xxxlib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "xxxlib/python3.11/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer.py", line 3572, in evaluate
    output = eval_loop(
             ^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer.py", line 3780, in evaluation_loop
    all_preds.add(logits)
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 326, in add
    self.tensors = nested_concat(self.tensors, tensors, padding_index=self.padding_index)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in nested_concat
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in <genexpr>
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in nested_concat
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in <genexpr>
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in nested_concat
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 138, in <genexpr>
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 140, in nested_concat
    return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xxxlib/python3.11/site-packages/transformers/trainer_pt_utils.py", line 99, in torch_pad_and_concatenate
    return torch.cat((tensor1, tensor2), dim=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 454 but got size 482 for tensor number 1 in the list.
  0%|          | 10/24240 [00:27<18:35:03,  2.76s/it]

SangbumChoi commented 1 month ago

@hadariru

I found out that self.label_names and loss_without_labels when it is evaluating is [] and False

usually label_names for bounding box should be 'labels' but it depends on your dataset.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers