Training model constantly increases memory consumption

JamesBowerXanda commented 7 months ago

System Info

transformers version: 4.39.3
Platform: macOS-14.4-arm64-arm-64bit
Python version: 3.11.8
Huggingface_hub version: 0.22.1
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am trying to finetune the SpeechT5ForTextToSpeech model on the "lj_speech" dataset. I am using the Seq2SeqTrainer class to do this. My configuration is:

training_args = Seq2SeqTrainingArguments(
        output_dir="./speecht5_lj_speech_most_common",  # change to a repo name of your choice
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        learning_rate=1e-5,
        warmup_steps=100,
        max_steps=15000,
        gradient_checkpointing=False,
        fp16=False,
        evaluation_strategy="steps",
        per_device_eval_batch_size=8,
        save_steps=500,
        eval_steps=500,
        load_best_model_at_end=True,
        greater_is_better=False,
        label_names=["labels"],
        push_to_hub=False,
    )

trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=data["train"],
        eval_dataset=data["test"],
        data_collator=data_collator,
        tokenizer=processor.tokenizer,
    )

trainer.train()

For some reason the memory consumption is constantly increasing throughout the training run. It starts with a memory consumption of 27GB for the first few steps of training and by step 250 it has hit 49.16GB. No evaluations have been done to this point. It is my understanding that the memory footprint should not be constantly increasing after each step. Could anyone explain to me why this is happening.

Below is a full copy of the script:

from datasets import load_dataset, Audio
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
import os
import torch
from speechbrain.inference.speaker import EncoderClassifier
import numpy as np
from dataclasses import dataclass
from typing import Any, Dict, List, Union
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

def get_data():
    dataset = load_dataset("lj_speech")
    data = dataset["train"]

    return data

def extract_speaker_id(example):
    speaker_id = example["id"].split("-")[0]
    example["speaker_id"] = speaker_id
    return example

def extract_all_chars(batch):
    all_text = " ".join(batch["normalized_text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

def cleanup_text(inputs):

    replacements = [
        ('à', 'a'),
        ('â', 'a'),
        ('è', 'e'),
        ('ü', 'u'),
    ]

    for src, dst in replacements:
        inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
    return inputs

def create_speaker_embedding(waveform, speaker_model):
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

def add_speaker_embeddings(example, speaker_model):
    speaker_embeddings = create_speaker_embedding(example["audio"]["array"], speaker_model)
    example["speaker_embeddings"] = speaker_embeddings
    return example

def process_example(example, processor, speaker_embeddings_dict):
    example_p = processor(
        text=example["normalized_text"],
        audio_target = example["audio"]["array"],
        sampling_rate = example["audio"]["sampling_rate"],
    )
    example_p["labels"] = example_p["labels"][0]
    example_p["speaker_embeddings"] = speaker_embeddings_dict[example["speaker_id"]]
    return example_p

def is_not_too_long(input_ids):
    input_length = len(input_ids)
    return input_length < 500

@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __init__(self, processor, model):
        self.processor = processor
        self.model = model

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:

        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # collate the inputs and targets into a batch
        batch = self.processor.pad(
            input_ids=input_ids,
            labels=label_features,
            return_tensors="pt",
        )        

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = batch["labels"].masked_fill(
            batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
        )

        # not used during fine-tuning
        del batch["decoder_attention_mask"]

        # round down target lengths to multiple of reduction factor
        if self.model.config.reduction_factor > 1:
            target_lengths = torch.tensor([
                len(feature["input_values"]) for feature in label_features
            ])
            target_lengths = target_lengths.new([
                length - length % self.model.config.reduction_factor for length in target_lengths
            ])
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]

        # also add in the speaker embeddings
        batch["speaker_embeddings"] = torch.tensor(speaker_features)

        return batch

def main():
    data = get_data()
    data = data.map(extract_speaker_id)
    data = data.cast_column("audio", Audio(sampling_rate=16000))

    processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
    model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
    tokenizer = processor.tokenizer

    vocabs = data.map(
        extract_all_chars, 
        batched=True, 
        batch_size=-1, 
        keep_in_memory=True, 
        remove_columns=data.column_names,
    )

    dataset_vocab = set(vocabs["vocab"][0])
    tokenizer_vocab = {k for k,_ in tokenizer.get_vocab().items()}
    dataset_vocab - tokenizer_vocab

    data = data.map(cleanup_text)

    spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    speaker_model = EncoderClassifier.from_hparams(
        source=spk_model_name, 
        run_opts={"device": device}, 
        savedir=os.path.join("/tmp", spk_model_name)
    )

    ids_with_audio = data.select_columns(["speaker_id", "audio"])
    df = ids_with_audio.map(lambda example: add_speaker_embeddings(example, speaker_model)).to_pandas()

    speaker_embeddings_dict = {
        speaker_id: np.empty((0, 512)) for speaker_id in df["speaker_id"].unique()
    }

    for speaker_id, speaker_embedding in zip(df["speaker_id"], df["speaker_embeddings"]):
        speaker_embeddings_dict[speaker_id] = np.concatenate(
            [speaker_embeddings_dict[speaker_id], np.expand_dims(speaker_embedding,axis=0)], axis=0
        )

    for speaker_id, speaker_embedding in speaker_embeddings_dict.items():
        speaker_embeddings_dict[speaker_id] = np.mean(speaker_embedding, axis=0)

    data = data.map(
        lambda example: process_example(example, processor, speaker_embeddings_dict), remove_columns=data.column_names,
    )

    data = data.filter(is_not_too_long, input_columns=["input_ids"])
    data = data.train_test_split(test_size=0.01)

    data_collator = TTSDataCollatorWithPadding(processor=processor, model=model)

    training_args = Seq2SeqTrainingArguments(
        output_dir="./speecht5_lj_speech_most_common",  # change to a repo name of your choice
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        learning_rate=1e-5,
        warmup_steps=100,
        max_steps=15000,
        gradient_checkpointing=False,
        fp16=False,
        evaluation_strategy="steps",
        per_device_eval_batch_size=8,
        save_steps=500,
        eval_steps=500,
        load_best_model_at_end=True,
        greater_is_better=False,
        label_names=["labels"],
        push_to_hub=False,
    )

    trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=data["train"],
        eval_dataset=data["test"],
        data_collator=data_collator,
        tokenizer=processor.tokenizer,
    )

    trainer.evaluate()

    trainer.train()

    return model

if __name__=="__main__":
    main()

Expected behavior

Memory consumption to be approximately constant during the training process.

amyeroberts commented 7 months ago

Hi @JamesBowerXanda, thanks for raising this issue and providing a script and the environment info.

Could you provide some more information about the memory consumption? Ideally we'd see some kind of graph with it changing overtime.

cc @muellerzr @pacman100 @ylacombe @sanchit-gandhi

JamesBowerXanda commented 7 months ago

Hi, it may be that I was a bit hasty raising this. I was using the Activity Monitor on the mac to check the memory usage and whilst it has gone up to 73GB for the process the script does seem to be still running and their is only 32 GB of Physical memory on the machine so it might just be that I am misunderstanding something in the activity monitor or there is something strange going on in the process memory consumption calculation.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xiyang-aads-lilly commented 3 weeks ago

Screenshot 2024-10-14 at 9 19 12 AM

@JamesBowerXanda Do you still have the problem? Did you figure out why? cc: @amyeroberts

Please check the screenshot above, I am using https://github.com/huggingface/alignment-handbook to train LLM, what I observed from wandb log is that it keeps increasing system memory usage and when it reaches 96% ish, the training will crash.

JamesBowerXanda commented 3 weeks ago

@xiyang-aads-lilly I thought it was something bugging out with activity monitor but it turned out it wasn't. I actually opened another issue here but still haven't gotten to the bottom of the issue.

xiyang-aads-lilly commented 3 weeks ago

@xiyang-aads-lilly I thought it was something bugging out with activity monitor but it turned out it wasn't. I actually opened another issue here but still haven't gotten to the bottom of the issue.

Thanks for the reply and point me to the latest issue!

I saw the suggestions on torch_empty_cache_steps, I will give a try on that first.

JamesBowerXanda commented 3 weeks ago

@xiyang-aads-lilly I don't see how it could hurt to add it in as well. I am completely stumped on what to do about it. I get the impression that it is going to be attributed to a lower level problem with Pytorch on mps though so won't be fixed through this forum. The issue is we are not sure which part of the trainer is causing the issue so it is hard to raise an issue on torch.

huggingface / transformers