huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.72k stars 1.52k forks source link

Example semantic_segmentation_peft_lora - better results #1927

Open ibayer opened 1 month ago

ibayer commented 1 month ago

System Info

image

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04 LTS
Release:    24.04
Codename:   noble
$python --version
Python 3.11.9

Who can help?

@BenjaminBossan

Information

Tasks

Reproduction

Modestly increase the number of sample in the official semantic_segmentation_peft_lora.ipynb example.

from datasets import load_dataset
# ds = load_dataset("scene_parse_150", split="train[:150]")
ds = load_dataset("scene_parse_150", split="train[:1500]")
RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 23592960000 bytes. Error code 12 (Cannot allocate memory)

full log:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[20], line 27
      3 training_args = TrainingArguments(
      4     output_dir=f"{model_name}-scene-parse-150-lora",
      5     learning_rate=5e-4,
   (...)
     16     label_names=["labels"],
     17 )
     19 trainer = Trainer(
     20     model=lora_model,
     21     args=training_args,
   (...)
     24     compute_metrics=compute_metrics,
     25 )
---> 27 trainer.train()

File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1930         hf_hub_utils.enable_progress_bars()
   1931 else:
-> 1932     return inner_training_loop(
   1933         args=args,
   1934         resume_from_checkpoint=resume_from_checkpoint,
   1935         trial=trial,
   1936         ignore_keys_for_eval=ignore_keys_for_eval,
   1937     )

File ~/xx/lib/python3.11/site-packages/transformers/trainer.py:2365, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2362     self.control.should_training_stop = True
   2364 self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
-> 2365 self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
   2367 if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
   2368     if is_torch_xla_available():
   2369         # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)

File ~/xx/lib/python3.11/site-packages/transformers/trainer.py:2793, in Trainer._maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
   2791 metrics = None
   2792 if self.control.should_evaluate:
-> 2793     metrics = self._evaluate(trial, ignore_keys_for_eval)
   2795 if self.control.should_save:
   2796     self._save_checkpoint(model, trial, metrics=metrics)

File ~/xx/lib/python3.11/site-packages/transformers/trainer.py:2750, in Trainer._evaluate(self, trial, ignore_keys_for_eval, skip_scheduler)
   2749 def _evaluate(self, trial, ignore_keys_for_eval, skip_scheduler=False):
-> 2750     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
   2751     self._report_to_hp_search(trial, self.state.global_step, metrics)
   2753     # Run delayed LR scheduler now that metrics are populated

File ~/xx/lib/python3.11/site-packages/transformers/trainer.py:3641, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   3638 start_time = time.time()
   3640 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3641 output = eval_loop(
   3642     eval_dataloader,
   3643     description="Evaluation",
   3644     # No point gathering the predictions if there are no metrics, otherwise we defer to
   3645     # self.args.prediction_loss_only
   3646     prediction_loss_only=True if self.compute_metrics is None else None,
   3647     ignore_keys=ignore_keys,
   3648     metric_key_prefix=metric_key_prefix,
   3649 )
   3651 total_batch_size = self.args.eval_batch_size * self.args.world_size
   3652 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File ~/xx/lib/python3.11/site-packages/transformers/trainer.py:3923, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   3919         metrics = self.compute_metrics(
   3920             EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs)
   3921         )
   3922     else:
-> 3923         metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
   3924 elif metrics is None:
   3925     metrics = {}

Cell In[15], line 13, in compute_metrics(eval_pred)
     11 logits_tensor = torch.from_numpy(logits)
     12 # scale the logits to the size of the label
---> 13 logits_tensor = nn.functional.interpolate(
     14     logits_tensor,
     15     size=labels.shape[-2:],
     16     mode="bilinear",
     17     align_corners=False,
     18 ).argmax(dim=1)
     20 pred_labels = logits_tensor.detach().cpu().numpy()
     21 # currently using _compute instead of compute
     22 # see this issue for more info: https://github.com/huggingface/evaluate/pull/328#issuecomment-1286866576

File ~/xx/lib/python3.11/site-packages/torch/nn/functional.py:4065, in interpolate(input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias)
   4059         if torch.are_deterministic_algorithms_enabled() and input.is_cuda:
   4060             # Use slow decomp whose backward will be in terms of index_put
   4061             # importlib is required because the import cannot be top level
   4062             # (cycle) and cannot be nested (TS doesn't support)
   4063             return importlib.import_module('torch._decomp.decompositions')._upsample_linear_vec(
   4064                 input, output_size, align_corners, scale_factors)
-> 4065     return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
   4066 if input.dim() == 5 and mode == "trilinear":
   4067     assert align_corners is not None

RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 23592960000 bytes. Error code 12 (Cannot allocate memory)

Note, I also had to disable jitter

def train_transforms(example_batch):
    # images = [jitter(handle_grayscale_image(x)) for x in example_batch["image"]]
    images = [handle_grayscale_image(x) for x in example_batch["image"]]
    labels = [x for x in example_batch["annotation"]]
    inputs = image_processor(images, labels)
    return inputs

and restrict to single GPU for get the example nb to work with the default configuration.

Expected behavior

Eval memory requirement should be somewhat independent of the training set size and not require excessive amount of ram for small subset of the dataset.

ibayer commented 1 month ago

Thank for this great library!

I'm currently exploring if peft allows me to use lora for semantic segmentation.

The example nb states:

The results are definitely not as expected and as mentioned above, this example is not meant to provide a state-of-the-art model. It exists to familiarize you with the end-to-end workflow.

and indeed the provided settings provide visual and metric results that are so far off from expected benchmark results that's hard to judge if the model is doing some meaningful learning.

As a first step to improve the results I'm following the suggestions from the nb

Here are some things that you can try to get better results:

  • Increase the number of training samples.
  • Try a larger SegFormer model variant (know about the available model variants here).
  • Try different values for the arguments available in LoraConfig.
  • Tune the learning rate and batch size.

My previous comment describes the first problem I ran into when increasing the number of training examples. Any suggestion on how to make the helper function compute_metrics() more memory efficient or reduce the memory consumption in some other way is very much appreciate.

BenjaminBossan commented 1 month ago

Hi Immanuel, thanks for reporting the issue.

So I think the issue is not strictly with PEFT, but instead that the custom compute_metris function defined within the notebook does not perform any batching, as you correctly identified. Therefore, the eval dataset will be run at once, which can easily result in an OOM error.

I don't know much about the evaluate package, but I think this updated version should work without OOM:

def compute_metrics(eval_pred, batch_size=16):
    logits, labels = eval_pred
    total_metrics = {}
    with torch.no_grad():
        for i in range(0, len(logits), batch_size):
            logits_tensor = torch.from_numpy(logits[i:i+batch_size])
            # scale the logits to the size of the label
            logits_tensor = nn.functional.interpolate(
                logits_tensor,
                size=labels.shape[-2:],
                mode="bilinear",
                align_corners=False,
            ).argmax(dim=1)

            pred_labels = logits_tensor.detach().cpu().numpy()
            # currently using _compute instead of compute
            # see this issue for more info:
            metrics = metric._compute(
                predictions=pred_labels,
                references=labels,
                num_labels=len(id2label),
                ignore_index=0,
                reduce_labels=image_processor.do_reduce_labels,
            )

            # add per category metrics as individual key-value pairs
            per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
            per_category_iou = metrics.pop("per_category_iou").tolist()

            metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
            metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})

            for k, v in metrics.items():
                total_metrics[k] = total_metrics.get(k, 0) + v

    # calculation is not quite correct, as the last batch could be smaller, np.average would be better
    total_metrics = {k: v / len(logits) for k, v in total_metrics.items()}
    return total_metrics

Note, I also had to disable jitter

What problem did you encounter?

and restrict to single GPU for get the example nb to work with the default configuration.

What was the error, did it happen to be

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

ibayer commented 1 month ago

What was the error, did it happen to be

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

No, it was a bit more obscure (see below), but I have only 2 GPUs so I'm losing less then 2x. :)

this works: flags + single GPU

image

no flags + 2 GPUs fails with:

File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/accelerate/state.py:292, in PartialState.__init__(self, cpu, **kwargs)
    290     if self.device.type == "cuda" and not check_cuda_p2p_ib_support():
    291         if "NCCL_P2P_DISABLE" not in os.environ or "NCCL_IB_DISABLE" not in os.environ:
--> 292             raise NotImplementedError(
    293                 "Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. "
    294                 'Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which '
    295                 "will do this automatically."
    296             )
    297 # Important: This should be the *only* code outside of `self.initialized!`
    298 self.fork_launched = parse_flag_from_env("FORK_LAUNCHED", 0)

NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

image

Flag with 2 GPUs

Click me ```bash --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[21], line 27 3 training_args = TrainingArguments( 4 output_dir=f"{model_name}-scene-parse-150-lora", 5 learning_rate=5e-4, (...) 16 label_names=["labels"], 17 ) 19 trainer = Trainer( 20 model=lora_model, 21 args=training_args, (...) 24 compute_metrics=compute_metrics, 25 ) ---> 27 trainer.train() File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1930 hf_hub_utils.enable_progress_bars() 1931 else: -> 1932 return inner_training_loop( 1933 args=args, 1934 resume_from_checkpoint=resume_from_checkpoint, 1935 trial=trial, 1936 ignore_keys_for_eval=ignore_keys_for_eval, 1937 ) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:2268, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2265 self.control = self.callback_handler.on_step_begin(args, self.state, self.control) 2267 with self.accelerator.accumulate(model): -> 2268 tr_loss_step = self.training_step(model, inputs) 2270 if ( 2271 args.logging_nan_inf_filter 2272 and not is_torch_xla_available() 2273 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 2274 ): 2275 # if loss is nan or inf simply add the average of previous logged losses 2276 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:3307, in Trainer.training_step(self, model, inputs) 3304 return loss_mb.reduce_mean().detach().to(self.args.device) 3306 with self.compute_loss_context_manager(): -> 3307 loss = self.compute_loss(model, inputs) 3309 del inputs 3311 kwargs = {} File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:3338, in Trainer.compute_loss(self, model, inputs, return_outputs) 3336 else: 3337 labels = None -> 3338 outputs = model(**inputs) 3339 # Save past state if it exists 3340 # TODO: this needs to be fixed and made cleaner later. 3341 if self.args.past_index >= 0: File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs) 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(*args, **kwargs) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(*args, **kwargs) 1543 try: 1544 result = None File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py:185, in DataParallel.forward(self, *inputs, **kwargs) 183 return self.module(*inputs[0], **module_kwargs[0]) 184 replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) --> 185 outputs = self.parallel_apply(replicas, inputs, module_kwargs) 186 return self.gather(outputs, self.output_device) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py:200, in DataParallel.parallel_apply(self, replicas, inputs, kwargs) 199 def parallel_apply(self, replicas: Sequence[T], inputs: Sequence[Any], kwargs: Any) -> List[Any]: --> 200 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py:108, in parallel_apply(modules, inputs, kwargs_tup, devices) 106 output = results[i] 107 if isinstance(output, ExceptionWrapper): --> 108 output.reraise() 109 outputs.append(output) 110 return outputs File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/_utils.py:705, in ExceptionWrapper.reraise(self) 701 except TypeError: 702 # If the exception takes multiple arguments, don't try to 703 # instantiate since we don't know how to 704 raise RuntimeError(msg) from None --> 705 raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker output = module(*input, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/peft/peft_model.py", line 734, in forward return self.get_base_model()(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/models/segformer/modeling_segformer.py", line 799, in forward logits = self.decode_head(encoder_hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/peft/utils/other.py", line 262, in forward return self.modules_to_save[self.active_adapter](*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/models/segformer/modeling_segformer.py", line 722, in forward hidden_states = self.linear_fuse(torch.cat(all_hidden_states[::-1], dim=1)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward return self._conv_forward(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpu/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward return F.conv2d(input, weight, bias, self.stride, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: misaligned address CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

edit: fix error log for last case flags + 2 GPUs (flags need to be set before first import to take effect...)

ibayer commented 1 month ago

I don't know much about the evaluate package, but I think this updated version should work without OOM:

Confirmed, this fixed OOM. Thanks!

ibayer commented 1 month ago
Note, I also had to disable jitter

What problem did you encounter?

Can't reproduce it anymore, I might have been unlucky with my previous torchvision version. I suspected a missing range clipping, anyway it's gone and hopefully it won't come back. :)

ibayer commented 1 month ago

for full dataset

from datasets import load_dataset
# ds = load_dataset("scene_parse_150", split="train[:150]")
ds = load_dataset("scene_parse_150", split="train")

eval still seems to eat to much memory, but GPU this time OutOfMemoryError: CUDA out of memory. Tried to allocate 8.08 GiB. GPU

Click me - full log ```bash --------------------------------------------------------------------------- OutOfMemoryError Traceback (most recent call last) Cell In[22], line 29 3 training_args = TrainingArguments( 4 output_dir=f"{model_name}-scene-parse-150-lora", 5 # learning_rate=5e-4, (...) 18 label_names=["labels"], 19 ) 21 trainer = Trainer( 22 model=lora_model, 23 args=training_args, (...) 26 compute_metrics=compute_metrics, 27 ) ---> 29 trainer.train() File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1930 hf_hub_utils.enable_progress_bars() 1931 else: -> 1932 return inner_training_loop( 1933 args=args, 1934 resume_from_checkpoint=resume_from_checkpoint, 1935 trial=trial, 1936 ignore_keys_for_eval=ignore_keys_for_eval, 1937 ) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:2365, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2362 self.control.should_training_stop = True 2364 self.control = self.callback_handler.on_epoch_end(args, self.state, self.control) -> 2365 self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) 2367 if DebugOption.TPU_METRICS_DEBUG in self.args.debug: 2368 if is_torch_xla_available(): 2369 # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:2793, in Trainer._maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) 2791 metrics = None 2792 if self.control.should_evaluate: -> 2793 metrics = self._evaluate(trial, ignore_keys_for_eval) 2795 if self.control.should_save: 2796 self._save_checkpoint(model, trial, metrics=metrics) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:2750, in Trainer._evaluate(self, trial, ignore_keys_for_eval, skip_scheduler) 2749 def _evaluate(self, trial, ignore_keys_for_eval, skip_scheduler=False): -> 2750 metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) 2751 self._report_to_hp_search(trial, self.state.global_step, metrics) 2753 # Run delayed LR scheduler now that metrics are populated File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:3641, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix) 3638 start_time = time.time() 3640 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop -> 3641 output = eval_loop( 3642 eval_dataloader, 3643 description="Evaluation", 3644 # No point gathering the predictions if there are no metrics, otherwise we defer to 3645 # self.args.prediction_loss_only 3646 prediction_loss_only=True if self.compute_metrics is None else None, 3647 ignore_keys=ignore_keys, 3648 metric_key_prefix=metric_key_prefix, 3649 ) 3651 total_batch_size = self.args.eval_batch_size * self.args.world_size 3652 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics: File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer.py:3848, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix) 3846 logits = self.gather_function((logits)) 3847 if not self.args.batch_eval_metrics or description == "Prediction": -> 3848 all_preds.add(logits) 3849 if labels is not None: 3850 labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer_pt_utils.py:327, in EvalLoopContainer.add(self, tensors) 325 self.tensors = tensors if self.do_nested_concat else [tensors] 326 elif self.do_nested_concat: --> 327 self.tensors = nested_concat(self.tensors, tensors, padding_index=self.padding_index) 328 else: 329 self.tensors.append(tensors) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer_pt_utils.py:141, in nested_concat(tensors, new_tensors, padding_index) 139 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors)) 140 elif isinstance(tensors, torch.Tensor): --> 141 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index) 142 elif isinstance(tensors, Mapping): 143 return type(tensors)( 144 {k: nested_concat(t, new_tensors[k], padding_index=padding_index) for k, t in tensors.items()} 145 ) File ~/micromamba/envs/arcd2-libcheck-peft/lib/python3.11/site-packages/transformers/trainer_pt_utils.py:99, in torch_pad_and_concatenate(tensor1, tensor2, padding_index) 96 tensor2 = atleast_1d(tensor2) 98 if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]: ---> 99 return torch.cat((tensor1, tensor2), dim=0) 101 # Let's figure out the new shape 102 new_shape = (tensor1.shape[0] + tensor2.shape[0], max(tensor1.shape[1], tensor2.shape[1])) + tensor1.shape[2:] OutOfMemoryError: CUDA out of memory. Tried to allocate 8.08 GiB. GPU ```
BenjaminBossan commented 1 month ago
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

Ah yes, this is an unfortunate annoyance with the 4000 series, see here: https://github.com/huggingface/accelerate/pull/2195.

eval still seems to eat to much memory, but GPU this time

Okay, so it seems that Trainer still collects all predictions in a giant tensor, even after changing compute_metrics to use batching. There is a "correct" way of applying batching though. For this, you need to pass batch_eval_metrics=True to the TrainingArguments. We also need to rewrite compute_metrics:

class MyMetrics:
    def __init__(self):
        self.total_metrics = {}

    def __call__(self, eval_pred, compute_result=False):
        logits_tensor, labels = eval_pred
        with torch.no_grad():
            # scale the logits to the size of the label
            logits_tensor = nn.functional.interpolate(
                logits_tensor,
                size=labels.shape[-2:],
                mode="bilinear",
                align_corners=False,
            ).argmax(dim=1)

            pred_labels = logits_tensor.detach().cpu().numpy()
            # currently using _compute instead of compute
            # see this issue for more info:
            metrics = metric._compute(
                predictions=pred_labels,
                references=labels.detach().cpu().numpy(),
                num_labels=len(id2label),
                ignore_index=0,
                reduce_labels=image_processor.do_reduce_labels,
            )

            # add per category metrics as individual key-value pairs
            per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
            per_category_iou = metrics.pop("per_category_iou").tolist()

            metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
            metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})

            for k, v in metrics.items():
                self.total_metrics[k] = self.total_metrics.get(k, 0) + v

        result = -1
        if compute_result:
            result = {k: v / len(logits_tensor) for k, v in self.total_metrics.items()}
            self.total_metrics.clear()
        return result

which we then pass to Trainer as compute_metrics=MyMetrics(). For me, this resolves the memory issue. (IMHO, skorch handles this in a more user friendly way)

ibayer commented 1 month ago

Thanks, I can confirm that the OOM error is gone. Do you get all nan values as well? Also, I guess mean Iou is way to high and should increase not decrease.

image

ibayer commented 1 month ago

Ah yes, this is an unfortunate annoyance with the 4000 series, see here: https://github.com/huggingface/accelerate/pull/2195.

On RTX 3090+, infinity-band and peer-to-peer communication was removed (so the entirety of the 4000 series and beyond).

I'm not sure what the expected behavior is. Does this only mean 4000 series can't use optimal communication or that they can't communicate efficient at all and therefore parallel processing is completely disabled?

BenjaminBossan commented 1 month ago

I figured out why the scores are weird. The issue is that we currently calculate the scores on a super small batch size, i.e. most labels are not present for each batch, resulting in nans. The correct way to do the evaluation without hogging memory is probably to pull the data to CPU and store it there, then at the end concatenate it and calculate the loss on all eval samples. I updated the code to do this and new the scores are much more reasonable:

class MyMetrics:
    def __init__(self):
        #self.total_metrics = {}
        self.preds = []
        self.labels = []

    def __call__(self, eval_pred, compute_result=False):
        logits_tensor, labels = eval_pred

        logits_tensor = nn.functional.interpolate(
            logits_tensor,
            size=labels.shape[-2:],
            mode="bilinear",
            align_corners=False,
        ).argmax(dim=1)

        self.preds.append(logits_tensor.cpu().detach().numpy())
        self.labels.append(labels.cpu().detach().numpy())
        if not compute_result:
            return

        preds = np.concatenate(self.preds)
        labels = np.concatenate(self.labels)
        metrics = metric._compute(
            predictions=preds,
            references=labels,
            num_labels=len(id2label),
            ignore_index=0,
            reduce_labels=image_processor.do_reduce_labels,
        )

        # add per category metrics as individual key-value pairs
        per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
        per_category_iou = metrics.pop("per_category_iou").tolist()

        metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
        metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
        return metrics

Overall, the scores still look quite weak, but I'm no expert in this task at all. Probably some hyper-parameter tuning could yield better results. I also tested full fine-tuning and it didn't do any better, so it's probably not a PEFT error.

One thing to note is that the chosen model is very small, so LoRA might not be the right approach here. For instance, for r=32, 13% of parameters are trainable. Normally, we see <1%. Perhaps you could try a more powerful model.

Does this only mean 4000 series can't use optimal communication or that they can't communicate efficient at all and therefore parallel processing is completely disabled?

I don't really know how big of a difference this makes. You could try out this patched driver which enables P2P for 4090s:

https://github.com/tinygrad/open-gpu-kernel-modules

BenjaminBossan commented 1 month ago

I did some further testing with this notebook. One mistake I found is that the ignore_index should have been 255 and not 0, according to this doc. But even after changing that, the calculated scores were still super small.

Next I did something very basic, namely calculating accuracy the brute force way: raw_accuracy = (preds[labels!=ignore_index]==labels[labels!=ignore_index]).mean(). When using fully pre-trained models, this gives values of 0.8+, so much closer to expectations. I didn't dig deeper into why the evaluate metrics are so low, I suspect that it does something where the accuracy/IOU for each class is calculated separately and then averaged, but the class level scores are totally off because the "true negatives" (which is the vast majority) are all counted as incorrect. But that's just a guess.

I also found the origin of this notebook, which appears to be: https://huggingface.co/docs/transformers/v4.33.0/en/tasks/semantic_segmentation. Maybe this other notebook actually works better, but I haven't tested it: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SegFormer/Fine_tune_SegFormer_on_custom_dataset.ipynb. In that notebook, they use metric.add_batch, which is probably the correct way to avoid memory explosion.

Anyway, I attached the notebook that I used for these experiments (still only uses a subset of data), just rename the file ending to .ipynb. There, we can see that nvidia/mit-b0 can actually be trained with LoRA to improve its accuracy to be better than random (though the loss is gigantic!). I also tested some fully pre-trained models, where there is no improvement really. But that's not surprising, as these models are already pre-trained on the same dataset.

semantic_segmentation_peft_lora.txt

ibayer commented 1 month ago

I didn't dig deeper into why the evaluate metrics are so low, I suspect that it does something where the accuracy/IOU for each class is calculated separately and then averaged, [...]

Another guess would be that the score is adjusted for the total number of classes without considering that some don't show up in the validation set. Anyway, I need to take a closer look at how the metric should be calculated.

There, we can see that nvidia/mit-b0 can actually be trained with LoRA to improve its accuracy to be better than random (though the loss is gigantic!).

For the fine-tuned model, it looks like mean IOU ~ 100 * vali_loss, but the difference for the LoRA model is much larger. Sure, we can't expect the relationship to be linear, but this does look suspicious.

image

image

I took a quick peek so far and will take a closer look in the coming days.

Thanks, for all the detailed investigations!

ibayer commented 1 month ago

I spent some more time with the tutorial but currently have the impression that I need a setup where I get clearer metric feedback before I can start to optimize hyper-parameters for better feedback.

I think this issue could be closed since we only uncovered issues with the example and not with the PEFT library itself.

Thanks for all the help!

BenjaminBossan commented 1 month ago

I agree, this requires either digging deeper into how evaluate calculates the metrics or just rolling a custom metric (or maybe use sklearn?).

We can keep this issue open until stale bot closes it, as this gets more visibility and maybe someone comes along and gives some good insights.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.