idiap / coqui-ai-TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
https://coqui-tts.readthedocs.io
Mozilla Public License 2.0
531 stars 54 forks source link

[Bug] Bad results clone voice when loading with model_path and config_path #134

Open OlegRuban-ai opened 16 hours ago

OlegRuban-ai commented 16 hours ago

Describe the bug

I used 2 options to download model:

  1. tts = TTS( model_path="/XTTS", config_path="/XTTS/config.json", progress_bar=True, ).to('cuda')

  2. tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", progress_bar=True, gpu=True

In the first option, the result after generation is much worse than in the second with standard loading. Why? How to load all configuration files correctly and from where?

I took the model for model_path and the config from here: https://huggingface.co/coqui/XTTS-v2/tree/main

To Reproduce

  1. tts = TTS( model_path="/XTTS", config_path="/XTTS/config.json", progress_bar=True, ).to('cuda')

  2. tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", progress_bar=True, gpu=True

Expected behavior

No response

Logs

No response

Environment

TTS 0.24.3
PyTorch 1.8

Additional context

No response

eginhard commented 15 hours ago

Can you fix the seed as below and compare both? For me both methods of loading result in the exact same output:

import os

import torch
from trainer.io import get_user_data_dir

from TTS.api import TTS

model_name = "tts_models/multilingual/multi-dataset/xtts_v2"

xtts1 = TTS(model_name).to("cuda")

xtts_dir = os.path.join(get_user_data_dir("tts"), "--".join(model_name.split("/")))
xtts2 = TTS(model_path=xtts_dir, config_path=os.path.join(xtts_dir, "config.json")).to("cuda")

torch.manual_seed(123)
out1 = xtts1.tts("This is a test", speaker="Ana Florence", language="en")

torch.manual_seed(123)
out2 = xtts2.tts("This is a test", speaker="Ana Florence", language="en")

assert out1 == out2
OlegRuban-ai commented 6 hours ago

@eginhard

Thank you, but probplem not fix.

When we use text2speech, the results are identical if there is one speaker. But when using speaker_wav we do this:

model_tts.tts_to_file( text=prompt, file_path=audio_path_result, speaker_wav=processed_file,

emotion="neutral",

language=language,
split_sentences=split_sentences,
# speaker="Ana Florence",
# preset='high_quality",'

)

then the results are different.

from trainer.io import get_user_data_dir model_name = "tts_models/multilingual/multi-dataset/xtts_v2" tts_1 = TTS(model_name, gpu=True) xtts_dir = os.path.join(get_user_data_dir("tts"), "--".join(model_name.split("/"))) tts_2 = TTS(

model_path="/models_and_tokenizers/text2audio",

model_path=xtts_dir,
config_path="/models_and_tokenizers/text2audio/config.json",
progress_bar=False,
gpu=True,

)

When using tts_1, the voice is similar to the original, but when using tts_2, it is not at all similar.

But there is still a problem with the limit of 182 tokens for the Russian language. To fix this when loading from HF, I used a replacement in tokenizer.py, but when loading from a local file, that path is no longer used for some reason (tts/layers/xtts/tokenizer.py). How can I bypass the restriction?

And is it possible to put emphasis, control the speaker’s speaking speed and add emotions? I couldn't find anything like that in the code.

And can you help me? How to use split_sentences with this code?

from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts

config = XttsConfig() config.load_json("/path/to/xtts/config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True) model.cuda()

outputs = model.synthesize( "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.", config, speaker_wav="/data/TTS-public/_refclips/3.wav", gpt_cond_len=3, language="en", )