[Bug] 3x RTX 4090 GPUs Insufficient for TTS Model Training?

Describe the bug

I'm trying to train an Austrian TTS model with Vits, but despite trying various configurations, I haven't been able to start the training properly. It worked once with 5% of my data (around 7.5 hours) using a batch size of 2, but I suspect this isn't sufficient for good output quality.

# define model config
config = VitsConfig(
    batch_size=16,
    eval_batch_size=8,
    batch_group_size=1,
    num_loader_workers=0,
    num_eval_loader_workers=32,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="basic_german_cleaners",
    use_phonemes=True,
    phoneme_language="de",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache_tts"),
    compute_input_seq_cache=True,
    precompute_num_workers=12,
    print_step=20,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    use_speaker_embedding=True,
    test_sentences=[
        "Hallo, wie geht es dir? Ich hoffe, du hast einen schönen Tag.",
        "Das ist ein Test. Wir überprüfen, ob alles wie erwartet funktioniert.",
        "Ich lerne gerade Programmierung. Es ist eine sehr nützliche Fähigkeit, die viele Türen Öffnen kann.",
        "Die Sonne scheint heute. Es ist ein perfekter Tag, um draußen spazieren zu gehen und die Natur zu genießen.",
        "Ich mag Schokolade. Besonders dunkle Schokolade mit einem hohen Kakaoanteil ist mein Favorit."
    ],
    cudnn_enable=True,
    cudnn_benchmark=True,
    cudnn_deterministic=True
)

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 16.50 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 22.53 GiB is allocated by PyTorch, and 482.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Any suggestions for improvement?

To Reproduce

CUDA_VISIBLE_DEVICES="0, 1, 2" python -m trainer.distribute --script train.py

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090",
            "NVIDIA GeForce RTX 4090",
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.4"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.0+cu124",
        "TTS": "0.24.3",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.12.2",
        "version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30)"
    }
}

Additional context

No response

idiap / coqui-ai-TTS

[Bug] 3x RTX 4090 GPUs Insufficient for TTS Model Training? #154

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context