idiap / coqui-ai-TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
https://coqui-tts.readthedocs.io
Mozilla Public License 2.0
538 stars 56 forks source link

[Bug] Text duplication while audio generation #72

Closed olehsamoilenko closed 2 months ago

olehsamoilenko commented 2 months ago

Describe the bug

Sometimes redundant duplicated text is generated. I use default model and config (no fine-tuning). Occurrence rate is not 100%, it happens sometimes (that is why I use a loop in my code example below). In my example words "is inspired by the dishes" are generated several times, check the audio: https://drive.google.com/file/d/1geLlH2im1bCLMpQcQV7QgRWU0c57eG4y/view

May it relate to the fact that word "menu" occurs 2 times in my text? Text is pretty long, but < 250 characters so should be acceptable. Also may be related to the issue discussed here: https://github.com/coqui-ai/TTS/issues/3516 and potential fix here: https://github.com/coqui-ai/TTS/issues/3516#issuecomment-2050867261. Is it a bug or I use the library wrong?

CC: @eginhard @bensonbs

text = "on the menu that Sam our chef here has put together, Okay this is one of our best sellers isn't it Sam, Yes it is, So this is our scampi, So I grew up in a pub and a lot of the things on the menu is inspired by the dishes from"
print(len(text))
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

for i in range(10):
    tts.tts_to_file(text=text,
                    file_path=f"test_{i}.wav",
                    speaker_wav="./tests/data/ljspeech/wavs/LJ001-0001.wav",
                    language='en',
                    split_sentences=False)

To Reproduce

Run the code from description. Some of generated files may contain text duplication.

Expected behavior

Redundant text is not generated.

Logs

226
/Users/olehsamoilenko/coqui-ai-TTS/TTS/tts/layers/xtts/xtts_manager.py:6: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.speakers = torch.load(speaker_file_path)
/opt/anaconda3/envs/coqui/lib/python3.9/site-packages/trainer/io.py:83: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  return torch.load(f, map_location=map_location, **kwargs)

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.4.0",
        "TTS": "0.24.1",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Darwin",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "arm",
        "python": "3.9.19",
        "version": "Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030"
    }
}

Additional context

No response

eginhard commented 2 months ago

This is not a bug, it's just due to how the XTTS model works and not possible to avoid completely. You could try to shorten your input by splitting the sentences.