Questions related to MeloTTS

eehoeskrap commented 3 months ago

Thank you for creating a great repository. I wonder why there is no bert when converting a pytorch model of MeloTTS to an Onnx model. https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L206C1-L235C6

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            tones,
            sid,
            noise_scale,
            length_scale,
            noise_scale_w,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "tones",
            "sid",
            "noise_scale",
            "length_scale",
            "noise_scale_w",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

csukuangfj commented 3 months ago

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

eehoeskrap commented 3 months ago

In this code, you can get the bert value through the get_bert function. Bert calls a different torch model for each language, and there is only a Python implementation. https://github.com/myshell-ai/MeloTTS/blob/144a0980fac43411153209cf08a1998e3c161e10/melo/utils.py#L22

eehoeskrap commented 3 months ago

In your code, there is a part where bert and ja_bert are entered as model inputs in ModelWrapper. https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L172

So, even though I specified input_names as below when exporting to the onnx model, I am experiencing the phenomenon that there is no bert in the input in the onnx file.

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            sid,
            tones,
            lang_id,
            bert,
            ja_bert,
            sdp_ratio,
            noise_scale,
            noise_scale_w,
            length_scale,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "sid",
            "tones",
            "lang_id",
            "bert",
            "ja_bert",
            "sdp_ratio",
            "noise_scale",
            "noise_scale_w",
            "length_scale",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "lang_id": {0: "N", 1: "L"},
            "bert": {0: "N", 1: "L", 2: "D"},
            "ja_bert": {0: "N", 1: "L", 2: "D"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

csukuangfj commented 3 months ago

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

Please have a look at this comment. That is the main obstacle. If you can fix it, then we can support bert.

csukuangfj commented 3 months ago

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

eehoeskrap commented 3 months ago

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

As far as I know, there is currently no Korean version of Bert C++. I will try it and let you know.

Korean version of Bert (https://huggingface.co/kykim/bert-kor-base)

csukuangfj commented 3 months ago

By the way, the main issue is about the tokenizer.

eehoeskrap commented 3 months ago

By the way, the main issue is about the tokenizer.

Yes, I know that. If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L162

csukuangfj commented 3 months ago

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

eehoeskrap commented 3 months ago

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

I found this repo while trying to export MeloTTS models ONNX. When exporting ONNX in this code, I was wondering why bert was not included. Thanks to your answer, I found out that it is because there is no C++ implementation.

I already have a Korean tts model trained with custom data. I just succeeded in exporting onnx including bert values. However, the preprocessing process (tokenizer, etc.) was run in python.

The Korean version of MeloTTS torch model is exported to ONNX for inference, so it is quite fast. However, I need to try a C++ implementation of the preprocessing process like you did. I will try this. However, Korean phoneme processing is quite difficult.

As you mentioned earlier, the biggest question is "How do we implement the bert torch model in C++?" is correct. First, let's try exporting the bert model to onnx.

Thank you for the reply.

csukuangfj commented 3 months ago

Currently the ios version has to process the entire text before synthesizing the audio,

I just added the support for passing a callback from Swift to C. Please see #1218

Please play the samples received in the callback by yourself, possibly in a separate thread. We don't have time to add that.

Finally, also noticed ios version can't be published to app store due to framework issue.

Please have a look at https://github.com/k2-fsa/sherpa-onnx/issues/1172

By the way, contributions to sherpa-onnx are highly appreciated.

Hope that you can fix the issues by yourself.

@nanaghartey

nanaghartey commented 3 months ago

@csukuangfj No problem. I actually made some contributions but noticed the latest version fixes most of the issues i found. Example in sherpa-onnx/jni/jni.ccsome reserved words in java were used preventing porting of sample tts kotlin code to java. E.g Java_com_k2fsa_sherpa_onnx_SpeakerEmbeddingExtractor_new Now all is good!

By the way, I just checked out MeloTTS, finetuned a model and exported to sherpa onnx for android. It's great. How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

csukuangfj commented 3 months ago

How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

Yes, it is already supported. In case you don't know how to do it, I just added an example for you. Please see https://github.com/k2-fsa/sherpa-onnx/pull/1223

@nanaghartey

nanaghartey commented 3 months ago

@csukuangfj I have a single speaker fine tuned model (melo). it works great but when i convert to sherpa onnx and then use the provided zh_en .fst and .dict on android , i get wrong synthesis. I assumed it would work since my model is english. how can i generate the .fst and .dict files for my custom model? or can we make it work by changing the configurations?

csukuangfj commented 3 months ago

You don't need *.fst for English only models.

Could you post the code about how you add the metadata?

, i get wrong synthesis.

Could you be more specific? What does wrong mean?

nanaghartey commented 3 months ago

@csukuangfj thanks for the prompt response.

"wrong" here means unexpected output. wrong pronunciations.

Sorry but this is how i export (the default export script only exports chinese_english):

import torch
from melo.api import TTS
from melo.text import language_id_map, language_tone_start_map
from melo.text.chinese import pinyin_to_symbol_map
from melo.text.english import eng_dict, refine_syllables
from pypinyin import Style, lazy_pinyin, phrases_dict, pinyin_dict
from typing import Any, Dict
import json

# Prepare the pinyin to symbol map
for k, v in pinyin_to_symbol_map.items():
    if isinstance(v, list):
        break
    pinyin_to_symbol_map[k] = v.split()

# Function to get initial, final, and tone from pinyin
def get_initial_final_tone(word: str):
    initials = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.INITIALS)
    finals = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)

    ans_phone = []
    ans_tone = []

    for c, v in zip(initials, finals):
        raw_pinyin = c + v
        v_without_tone = v[:-1]
        try:
            tone = v[-1]
        except:
            return [], []

        pinyin = c + v_without_tone
        if c:
            v_rep_map = {
                "uei": "ui",
                "iou": "iu",
                "uen": "un",
            }
            if v_without_tone in v_rep_map.keys():
                pinyin = c + v_rep_map[v_without_tone]
        else:
            pinyin_rep_map = {
                "ing": "ying",
                "i": "yi",
                "in": "yin",
                "u": "wu",
            }
            if pinyin in pinyin_rep_map.keys():
                pinyin = pinyin_rep_map[pinyin]
            else:
                single_rep_map = {
                    "v": "yu",
                    "e": "e",
                    "i": "y",
                    "u": "w",
                }
                if pinyin[0] in single_rep_map.keys():
                    pinyin = single_rep_map[pinyin[0]] + pinyin[1:]

        if pinyin not in pinyin_to_symbol_map:
            continue
        phone = pinyin_to_symbol_map[pinyin]
        ans_phone += phone
        ans_tone += [tone] * len(phone)

    return ans_phone, ans_tone

# Function to generate tokens file
def generate_tokens(symbol_list):
    with open("tokens.txt", "w", encoding="utf-8") as f:
        for i, s in enumerate(symbol_list):
            f.write(f"{s} {i}\n")

# Function to add new English words to the lexicon
def add_new_english_words(lexicon):
    lexicon["kaldi"] = [["K", "AH0"], ["L", "D", "IH0"]]
    lexicon["SF"] = [["EH1", "S"], ["EH1", "F"]]

# Function to generate lexicon file
def generate_lexicon():
    word_dict = pinyin_dict.pinyin_dict
    phrases = phrases_dict.phrases_dict
    add_new_english_words(eng_dict)
    with open("lexicon.txt", "w", encoding="utf-8") as f:
        for word in eng_dict:
            phones, tones = refine_syllables(eng_dict[word])
            tones = [t + language_tone_start_map["EN"] for t in tones]
            tones = [str(t) for t in tones]

            phones = " ".join(phones)
            tones = " ".join(tones)

            f.write(f"{word.lower()} {phones} {tones}\n")

        for key in word_dict:
            if not (0x4E00 <= key <= 0x9FA5):
                continue
            w = chr(key)
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

        for w in phrases:
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

# Function to add metadata to ONNX model
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    import onnx
    model = onnx.load(filename)
    while len(model.metadata_props):
        model.metadata_props.pop()

    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

# ModelWrapper class definition
class ModelWrapper(torch.nn.Module):
    def __init__(self, model: "SynthesizerTrn"):
        super().__init__()
        self.model = model
        self.lang_id = language_id_map[model.language]

    def forward(
        self,
        x,
        x_lengths,
        tones,
        sid,
        noise_scale,
        length_scale,
        noise_scale_w,
        max_len=None,
    ):
        bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        lang_id = torch.zeros_like(x)
        lang_id[:, 1::2] = self.lang_id
        return self.model.model.infer(
            x=x,
            x_lengths=x_lengths,
            sid=sid,
            tone=tones,
            language=lang_id,
            bert=bert,
            ja_bert=ja_bert,
            noise_scale=noise_scale,
            noise_scale_w=noise_scale_w,
            length_scale=length_scale,
        )[0]

# Main function to handle model loading and ONNX export
def main():
    generate_lexicon()  # Generate the lexicon.txt file

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

    generate_tokens(config["symbols"])  # Generate tokens.txt file

    torch_model = ModelWrapper(model)

    x = torch.randint(low=0, high=10, size=(60,), dtype=torch.int64)
    x_lengths = torch.tensor([x.size(0)], dtype=torch.int64)
    sid = torch.tensor([0], dtype=torch.int64)
    tones = torch.zeros_like(x)

    noise_scale = torch.tensor([1.0], dtype=torch.float32)
    length_scale = torch.tensor([1.0], dtype=torch.float32)
    noise_scale_w = torch.tensor([1.0], dtype=torch.float32)

    x = x.unsqueeze(0)
    tones = tones.unsqueeze(0)

    filename = "model.onnx"
    torch.onnx.export(
        torch_model,
        (x, x_lengths, tones, sid, noise_scale, length_scale, noise_scale_w),
        filename,
        opset_version=13,
        input_names=["x", "x_lengths", "tones", "sid", "noise_scale", "length_scale", "noise_scale_w"],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

    meta_data = {
        "model_type": "melo-vits",
        "comment": "melo",
        "version": 2,
        "language": "English",
        "add_blank": int(config["data"]["add_blank"]),
        "n_speakers": config["data"]["n_speakers"],
        "jieba": 1,
        "sample_rate": config["data"]["sampling_rate"],
        "bert_dim": 1024,
        "ja_bert_dim": 768,
        "speaker_id": list(config["data"]["spk2id"].values())[0],
        "lang_id": language_id_map["EN"],
        "tone_start": language_tone_start_map["EN"],
        "url": "https://github.com/myshell-ai/MeloTTS",
        "license": "MIT license",
        "description": "MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai",
    }
    add_meta_data(filename, meta_data)

if __name__ == "__main__":
    main()

then in api.py i do:

class TTS(nn.Module):
    def __init__(self, 
                 language,
                 device='auto',
                 use_hf=True,
                 config_path=None,
                 ckpt_path=None):
        super().__init__()
        if device == 'auto':
            device = 'cpu'
            if torch.cuda.is_available():
                device = 'cuda'
            if torch.backends.mps.is_available():
                device = 'mps'
        if 'cuda' in device:
            assert torch.cuda.is_available()

        # Load configuration from your custom config_path
        if config_path:
            hps = utils.get_hparams_from_file(config_path)
        else:
            hps = load_or_download_config(language, use_hf=use_hf)

        num_languages = hps.num_languages
        num_tones = hps.num_tones
        symbols = hps.symbols

        model = SynthesizerTrn(
            len(symbols),
            hps.data.filter_length // 2 + 1,
            hps.train.segment_size // hps.data.hop_length,
            n_speakers=hps.data.n_speakers,
            num_tones=num_tones,
            num_languages=num_languages,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.symbol_to_id = {s: i for i, s in enumerate(symbols)}
        self.hps = hps
        self.device = device

        # load state_dict
        checkpoint_dict = load_or_download_model(language, device, use_hf=use_hf, ckpt_path=ckpt_path)
        self.model.load_state_dict(checkpoint_dict['model'], strict=True)

        language = language.split('_')[0]
        self.language = 'ZH_MIX_EN' if language == 'ZH' else language

csukuangfj commented 3 months ago

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

csukuangfj commented 3 months ago

https://github.com/csukuangfj/onnxruntime-build/actions/runs/9184634501

You can see from the above link that we can successfully build a debug version of static lib.

nanaghartey commented 3 months ago

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

custom model 1 : Eng, news (african accent)

text - "things to look out for in the year 2020"

.pth generated wav -

https://github.com/user-attachments/assets/b6ca93ad-c38c-412c-8c6e-45e8b6e28a84

onnx generated wav -

https://github.com/user-attachments/assets/6dead35d-4ced-4883-827c-2b7cda9941fc

custom model 2 - Eng, singing (us accent)

text - "next time won't you sing with me"

.pth generated wav -

https://github.com/user-attachments/assets/4ce3f2be-a7ea-404d-be90-f4e80d712ab3

onnx generated wav -

https://github.com/user-attachments/assets/d7b0ce43-dca6-48ad-b2b6-33dc28a5ef31

i use sherpa-onnx but don't get logs. I was only trying out Melo on sherpa so models were not trained for long (training is not the issue though)

I hope you're able to spot the issue. Thanks

nanaghartey commented 3 months ago

@csukuangfj I can also share my model.pth and config.json files if that'd help.

csukuangfj commented 3 months ago

When you use .pth to test your model, can you zero out the bert part and try again?

nanaghartey commented 3 months ago

When you use .pth to test your model, can you zero out the bert part and try again?

The results is still better than onnx's when i zero out the bert part

csukuangfj commented 3 months ago

Could you show the code about how you did that?

nanaghartey commented 3 months ago

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

csukuangfj commented 3 months ago

could you please post the complete code?

nanaghartey commented 3 months ago

could you please post the complete code?

def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
        language = self.language
        texts = self.split_sentences_into_pieces(text, language, quiet)
        audio_list = []
        if pbar:
            tx = pbar(texts)
        else:
            if position:
                tx = tqdm(texts, position=position)
            elif quiet:
                tx = texts
            else:
                tx = tqdm(texts)
        for t in tx:
            if language in ['EN', 'ZH_MIX_EN']:
                t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
            device = self.device
            bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
            #bert = torch.zeros_like(bert).to(device)
            #ja_bert = torch.zeros_like(ja_bert).to(device)
            with torch.no_grad():
                x_tst = phones.to(device).unsqueeze(0)
                tones = tones.to(device).unsqueeze(0)
                lang_ids = lang_ids.to(device).unsqueeze(0)
                bert = bert.to(device).unsqueeze(0)
                ja_bert = ja_bert.to(device).unsqueeze(0)
                x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
                del phones
                speakers = torch.LongTensor([speaker_id]).to(device)
                audio = self.model.infer(
                        x_tst,
                        x_tst_lengths,
                        speakers,
                        tones,
                        lang_ids,
                        bert,
                        ja_bert,
                        sdp_ratio=sdp_ratio,
                        noise_scale=noise_scale,
                        noise_scale_w=noise_scale_w,
                        length_scale=1. / speed,
                    )[0][0, 0].data.cpu().float().numpy()
                del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
                # 
            audio_list.append(audio)
        torch.cuda.empty_cache()
        audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)

        if output_path is None:
            return audio
        else:
            if format:
                soundfile.write(output_path, audio, self.hps.data.sampling_rate, format=format)
            else:
                soundfile.write(output_path, audio, self.hps.data.sampling_rate)

csukuangfj commented 3 months ago

In api.py #def tts_to_file() i did:
    bert = torch.zeros_like(bert).to(device)
please share your solution if that is wrong.

Could you change

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)

to

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

nanaghartey commented 3 months ago

@csukuangfj

      bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

results is generated wav that sounds almost same as the original .pth inference (without zeroing out ) except for some few pronunciations that sound off. however it's way better than the wavs from onnx above. Here is the output with bert zeroed out:

https://github.com/user-attachments/assets/79b12910-318a-432d-8d08-0687d31e566b

https://github.com/user-attachments/assets/237d7511-a562-4674-b893-ddb2e5de54ea

I then tried :

  bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        bert.zero_()
        ja_bert.zero_()

    in export-onnx.py for onnx conversion but i got same "wrong" results shared earlier

csukuangfj commented 3 months ago

Please compare the inputs to the model manually and see if they are the same.

nanaghartey commented 3 months ago

Please compare the inputs to the model manually and see if they are the same.

my .pth has:

BERT input shape: torch.Size([1024, 71])
JA_BERT input shape: torch.Size([768, 71])
Phones input shape: torch.Size([71])
Tones input shape: torch.Size([71])
Language IDs shape: torch.Size([71])

What changes can i make to the onnx export script please, or any other way to get this EN model to work with sherpa onnx tts :( ?

csukuangfj commented 3 months ago

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/melo-tts/test.py

please use this script to test the onnx model.

By comparing the model inputs, I mean comparing the value of the inputs, including the shape.

nanaghartey commented 3 months ago

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/melo-tts/test.py

please use this script to test the onnx model.

By comparing the model inputs, I mean comparing the value of the inputs, including the shape.

Below is the output. What next step should i take please?

Dumping model to file cache /var/folders/vf/13g26rdn3673b6cxhlhy03yh0000gn/T/jieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built successfully.
这是
t 这是
w 这
这
w 是
是
一个
使用
t 使用
w 使
使
w 用
用

next

generation

kaldi

的

text

to

speech

中英文
t 中英文
w 中
中
w 英
英
w 文
文
例子
.

Thank
t Thank
w T
T
t T
w h
h
w a
a
w n
n
w k
k

you
!

你
觉得
如何
呢
?

are

you

ok
?

Fantastic
t Fantastic
w F
F
t F
w a
a
w n
n
w t
t
w a
a
w s
s
w t
t
w i
i
w c
c
!

How
t How
w H
H
t H
w o
o
w w
w

about

you
?
torch.Size([265]) torch.Size([265])
torch.Size([1, 265]) torch.Size([1, 265])

csukuangfj commented 3 months ago

Please compare the input tensor values.

nanaghartey commented 3 months ago

@csukuangfj I tried that but still didn't get good results. I even tried exporting the official Melo tts models on huggingface - https://huggingface.co/myshell-ai

At this point i think i may just have to continue using piper/coqui though it doesn't sound as good as MeloTTS. Thanks for all the support :)

csukuangfj commented 3 months ago

By the way, the comparison is to help debugging.

dhc45010 commented 3 months ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

csukuangfj commented 3 months ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

目前解决不了这个问题。

dhc45010 commented 3 months ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

目前解决不了这个问题。

好的感谢

studionexus-lk commented 3 months ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

csukuangfj commented 3 months ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

nanaghartey commented 3 months ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

Is there one for English only? In future if there is a way to convert a standard English model from the official training script can you share here? Thanks

csukuangfj commented 3 months ago

Sorry, I only have this one.

csukuangfj commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请用 onnxruntime 1.12.0

微信群里，有同学反馈，使用 onnxruntime 1.12.0, gpu 跑 melo tts, 不会有问题. @dhc45010

csukuangfj commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请看

https://github.com/k2-fsa/sherpa-onnx/pull/1379

@dhc45010

dhc45010 commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请看

1379

@dhc45010

好的感谢大佬的回复，我回头试试

nanaghartey commented 3 weeks ago

@csukuangfj any updates on getting the default MeloTTS models to work?

csukuangfj commented 3 weeks ago

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have? @nanaghartey

nanaghartey commented 3 weeks ago

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have?

@nanaghartey

There is support for Chinese+English MeloTTS model only . If one wants to use metlotts they have to stick to the Chinese+English model . I'm asking if there are any updates/documentation on converting e.g standard English MeloTTS models.

csukuangfj commented 3 weeks ago

please adapt our current script. if you have any troubles, please post error logs.

nanaghartey commented 3 weeks ago

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

k2-fsa / sherpa-onnx

Questions related to MeloTTS #1193

1379