Trainer memory leak for evaluation with `compute_metrics` with persistent workers

qubvel commented 4 months ago

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.31
Python version: 3.10.9
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.2
Accelerate version: 0.29.3
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 1.13.0+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@pacman100 @muellerzr

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

RAM usage increases after each evaluation stage. Training phase goes fine, but each evaluation stage with compute_metrics function increases RAM usage. If compute_metrics is not provided there is no leak. I used very simple compute_metrics function that returns constant:

def dummy_compute_metrics(evaluation_results):
    return {"loss": 1.0}

Here is the simplified script I am running to reproduce memory leaks. It makes just 1 step of training and then goes to the validation stage.

import torch
import numpy as np

from datasets import load_dataset
from functools import partial

from transformers import (
    MaskFormerImageProcessor,
    MaskFormerForInstanceSegmentation,
    Trainer,
    TrainingArguments,
)

def transform_batch(examples, image_processor):

    batch = {
        "pixel_values": [],
        "mask_labels": [],
        "class_labels": [],
    }

    for pil_image, pil_annotation in zip(examples["image"], examples["annotation"]):

        image = np.array(pil_image)
        semantic_and_instance_masks = np.array(pil_annotation)[..., :2]
        instance_mask = semantic_and_instance_masks[..., 1]

        # Create a mapping from instance id to semantic id
        unique_semantic_id_instance_id_pairs = np.unique(semantic_and_instance_masks.reshape(-1, 2), axis=0)
        instance_id_to_semantic_id = {
            instance_id: semantic_id 
            for semantic_id, instance_id in unique_semantic_id_instance_id_pairs
        }

        # Apply the image processor transformations: resizing, rescaling, normalization
        model_inputs = image_processor(
            images=[image],
            segmentation_maps=[instance_mask],
            instance_id_to_semantic_id=instance_id_to_semantic_id,
            return_tensors="pt",
        )

        batch["pixel_values"].append(model_inputs.pixel_values[0])
        batch["mask_labels"].append(model_inputs.mask_labels[0])
        batch["class_labels"].append(model_inputs.class_labels[0])

    return batch

def collate_fn(examples):
    batch = {}
    batch["pixel_values"] = torch.stack([example["pixel_values"] for example in examples])
    batch["class_labels"] = [example["class_labels"] for example in examples]
    batch["mask_labels"] = [example["mask_labels"] for example in examples]
    if "pixel_mask" in examples[0]:
        batch["pixel_mask"] = torch.stack([example["pixel_mask"] for example in examples])
    return batch

def dummy_compute_metrics(evaluation_results):
    return {"loss": 1.0}

if __name__ == "__main__":

    checkpoint = "facebook/maskformer-swin-tiny-ade"
    dataset_name = "qubvel-hf/ade20k-mini"

    # Dataset
    dataset = load_dataset(dataset_name)
    label2id = dataset["train"][0]["semantic_class_to_id"]
    id2label = {v: k for k, v in label2id.items()}

    # Image transformations
    image_processor = MaskFormerImageProcessor.from_pretrained(
        checkpoint, size={"height": 256, "width": 256},
    )

    dataset_transform_batch = partial(transform_batch, image_processor=image_processor)
    dataset["train"] = dataset["train"].with_transform(dataset_transform_batch).select(range(8))
    dataset["validation"] = dataset["validation"].with_transform(dataset_transform_batch)

    # Model
    model = MaskFormerForInstanceSegmentation.from_pretrained(
        checkpoint,
        label2id=label2id,
        id2label=id2label,
        ignore_mismatched_sizes=True,
    )

    # Training
    args = TrainingArguments(
        output_dir="memory-leak-reproducing",
        num_train_epochs=40,
        do_train=True,
        do_eval=True,
        fp16=True,
        dataloader_num_workers=4,
        per_device_train_batch_size=8,
        dataloader_persistent_workers=True,
        remove_unused_columns=False,
        eval_do_concat_batches=False,
        eval_strategy="epoch",
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
        data_collator=collate_fn,
        compute_metrics=dummy_compute_metrics,
    )

    trainer.train()

Expected behavior

Any ideas on how to identify what caused the memory leak and how that could be fixed?

muellerzr commented 3 months ago

cc @SunMarc @muellerzr

(The right muellerzr, which is how I missed this)

qubvel commented 3 months ago

@muellerzr After some investigation, I found that a leak happens if dataloader_persistent_workers=True, and if dataloader_persistent_workers=False there is no leak actually (that parameter was missed in the reproducing script, I added it now). Probably its not even related to compute_metrics.

muellerzr commented 3 months ago

@qubvel are you setting pin_memory=True?

muellerzr commented 3 months ago

That’s usually required, and should’ve thrown a warning

qubvel commented 3 months ago

No, I didn't set pin_memory=True and I didn't notice any warning, probably because the script is too verbose on start..

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers