k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.13k stars 364 forks source link

Questions related to MeloTTS #1193

Open eehoeskrap opened 1 month ago

eehoeskrap commented 1 month ago

Thank you for creating a great repository. I wonder why there is no bert when converting a pytorch model of MeloTTS to an Onnx model. https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L206C1-L235C6

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            tones,
            sid,
            noise_scale,
            length_scale,
            noise_scale_w,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "tones",
            "sid",
            "noise_scale",
            "length_scale",
            "noise_scale_w",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )
csukuangfj commented 1 month ago

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

eehoeskrap commented 1 month ago

In this code, you can get the bert value through the get_bert function. Bert calls a different torch model for each language, and there is only a Python implementation. https://github.com/myshell-ai/MeloTTS/blob/144a0980fac43411153209cf08a1998e3c161e10/melo/utils.py#L22

eehoeskrap commented 1 month ago

In your code, there is a part where bert and ja_bert are entered as model inputs in ModelWrapper. https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L172

So, even though I specified input_names as below when exporting to the onnx model, I am experiencing the phenomenon that there is no bert in the input in the onnx file.

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            sid,
            tones,
            lang_id,
            bert,
            ja_bert,
            sdp_ratio,
            noise_scale,
            noise_scale_w,
            length_scale,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "sid",
            "tones",
            "lang_id",
            "bert",
            "ja_bert",
            "sdp_ratio",
            "noise_scale",
            "noise_scale_w",
            "length_scale",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "lang_id": {0: "N", 1: "L"},
            "bert": {0: "N", 1: "L", 2: "D"},
            "ja_bert": {0: "N", 1: "L", 2: "D"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )
image
csukuangfj commented 1 month ago

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

Please have a look at this comment. That is the main obstacle. If you can fix it, then we can support bert.

csukuangfj commented 1 month ago

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

eehoeskrap commented 1 month ago

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

As far as I know, there is currently no Korean version of Bert C++. I will try it and let you know.

csukuangfj commented 1 month ago

By the way, the main issue is about the tokenizer.

eehoeskrap commented 1 month ago

By the way, the main issue is about the tokenizer.

Yes, I know that. If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L162

csukuangfj commented 1 month ago

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

eehoeskrap commented 1 month ago

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

I found this repo while trying to export MeloTTS models ONNX. When exporting ONNX in this code, I was wondering why bert was not included. Thanks to your answer, I found out that it is because there is no C++ implementation.

I already have a Korean tts model trained with custom data. I just succeeded in exporting onnx including bert values. However, the preprocessing process (tokenizer, etc.) was run in python.

The Korean version of MeloTTS torch model is exported to ONNX for inference, so it is quite fast. However, I need to try a C++ implementation of the preprocessing process like you did. I will try this. However, Korean phoneme processing is quite difficult.

As you mentioned earlier, the biggest question is "How do we implement the bert torch model in C++?" is correct. First, let's try exporting the bert model to onnx.

Thank you for the reply.

nanaghartey commented 1 month ago

@csukuangfj Unrelated question => Android tts does well by playing audio while generating . Can you add this to the ios TTS too? Currently the ios version has to process the entire text before synthesizing the audio, Finally, also noticed ios version can't be published to app store due to framework issue.

csukuangfj commented 1 month ago

Currently the ios version has to process the entire text before synthesizing the audio,

I just added the support for passing a callback from Swift to C. Please see #1218

Please play the samples received in the callback by yourself, possibly in a separate thread. We don't have time to add that.


Finally, also noticed ios version can't be published to app store due to framework issue.

Please have a look at https://github.com/k2-fsa/sherpa-onnx/issues/1172


By the way, contributions to sherpa-onnx are highly appreciated.

Hope that you can fix the issues by yourself.

@nanaghartey

nanaghartey commented 1 month ago

@csukuangfj No problem. I actually made some contributions but noticed the latest version fixes most of the issues i found. Example in sherpa-onnx/jni/jni.ccsome reserved words in java were used preventing porting of sample tts kotlin code to java. E.g Java_com_k2fsa_sherpa_onnx_SpeakerEmbeddingExtractor_new Now all is good!

By the way, I just checked out MeloTTS, finetuned a model and exported to sherpa onnx for android. It's great. How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

csukuangfj commented 1 month ago

How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

Yes, it is already supported. In case you don't know how to do it, I just added an example for you. Please see https://github.com/k2-fsa/sherpa-onnx/pull/1223

@nanaghartey

nanaghartey commented 1 month ago

@csukuangfj I have a single speaker fine tuned model (melo). it works great but when i convert to sherpa onnx and then use the provided zh_en .fst and .dict on android , i get wrong synthesis. I assumed it would work since my model is english. how can i generate the .fst and .dict files for my custom model? or can we make it work by changing the configurations?

csukuangfj commented 1 month ago

You don't need *.fst for English only models.

Could you post the code about how you add the metadata?


, i get wrong synthesis.

Could you be more specific? What does wrong mean?

nanaghartey commented 1 month ago

@csukuangfj thanks for the prompt response.

"wrong" here means unexpected output. wrong pronunciations.

Sorry but this is how i export (the default export script only exports chinese_english):

import torch
from melo.api import TTS
from melo.text import language_id_map, language_tone_start_map
from melo.text.chinese import pinyin_to_symbol_map
from melo.text.english import eng_dict, refine_syllables
from pypinyin import Style, lazy_pinyin, phrases_dict, pinyin_dict
from typing import Any, Dict
import json

# Prepare the pinyin to symbol map
for k, v in pinyin_to_symbol_map.items():
    if isinstance(v, list):
        break
    pinyin_to_symbol_map[k] = v.split()

# Function to get initial, final, and tone from pinyin
def get_initial_final_tone(word: str):
    initials = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.INITIALS)
    finals = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)

    ans_phone = []
    ans_tone = []

    for c, v in zip(initials, finals):
        raw_pinyin = c + v
        v_without_tone = v[:-1]
        try:
            tone = v[-1]
        except:
            return [], []

        pinyin = c + v_without_tone
        if c:
            v_rep_map = {
                "uei": "ui",
                "iou": "iu",
                "uen": "un",
            }
            if v_without_tone in v_rep_map.keys():
                pinyin = c + v_rep_map[v_without_tone]
        else:
            pinyin_rep_map = {
                "ing": "ying",
                "i": "yi",
                "in": "yin",
                "u": "wu",
            }
            if pinyin in pinyin_rep_map.keys():
                pinyin = pinyin_rep_map[pinyin]
            else:
                single_rep_map = {
                    "v": "yu",
                    "e": "e",
                    "i": "y",
                    "u": "w",
                }
                if pinyin[0] in single_rep_map.keys():
                    pinyin = single_rep_map[pinyin[0]] + pinyin[1:]

        if pinyin not in pinyin_to_symbol_map:
            continue
        phone = pinyin_to_symbol_map[pinyin]
        ans_phone += phone
        ans_tone += [tone] * len(phone)

    return ans_phone, ans_tone

# Function to generate tokens file
def generate_tokens(symbol_list):
    with open("tokens.txt", "w", encoding="utf-8") as f:
        for i, s in enumerate(symbol_list):
            f.write(f"{s} {i}\n")

# Function to add new English words to the lexicon
def add_new_english_words(lexicon):
    lexicon["kaldi"] = [["K", "AH0"], ["L", "D", "IH0"]]
    lexicon["SF"] = [["EH1", "S"], ["EH1", "F"]]

# Function to generate lexicon file
def generate_lexicon():
    word_dict = pinyin_dict.pinyin_dict
    phrases = phrases_dict.phrases_dict
    add_new_english_words(eng_dict)
    with open("lexicon.txt", "w", encoding="utf-8") as f:
        for word in eng_dict:
            phones, tones = refine_syllables(eng_dict[word])
            tones = [t + language_tone_start_map["EN"] for t in tones]
            tones = [str(t) for t in tones]

            phones = " ".join(phones)
            tones = " ".join(tones)

            f.write(f"{word.lower()} {phones} {tones}\n")

        for key in word_dict:
            if not (0x4E00 <= key <= 0x9FA5):
                continue
            w = chr(key)
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

        for w in phrases:
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

# Function to add metadata to ONNX model
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    import onnx
    model = onnx.load(filename)
    while len(model.metadata_props):
        model.metadata_props.pop()

    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

# ModelWrapper class definition
class ModelWrapper(torch.nn.Module):
    def __init__(self, model: "SynthesizerTrn"):
        super().__init__()
        self.model = model
        self.lang_id = language_id_map[model.language]

    def forward(
        self,
        x,
        x_lengths,
        tones,
        sid,
        noise_scale,
        length_scale,
        noise_scale_w,
        max_len=None,
    ):
        bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        lang_id = torch.zeros_like(x)
        lang_id[:, 1::2] = self.lang_id
        return self.model.model.infer(
            x=x,
            x_lengths=x_lengths,
            sid=sid,
            tone=tones,
            language=lang_id,
            bert=bert,
            ja_bert=ja_bert,
            noise_scale=noise_scale,
            noise_scale_w=noise_scale_w,
            length_scale=length_scale,
        )[0]

# Main function to handle model loading and ONNX export
def main():
    generate_lexicon()  # Generate the lexicon.txt file

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

    generate_tokens(config["symbols"])  # Generate tokens.txt file

    torch_model = ModelWrapper(model)

    x = torch.randint(low=0, high=10, size=(60,), dtype=torch.int64)
    x_lengths = torch.tensor([x.size(0)], dtype=torch.int64)
    sid = torch.tensor([0], dtype=torch.int64)
    tones = torch.zeros_like(x)

    noise_scale = torch.tensor([1.0], dtype=torch.float32)
    length_scale = torch.tensor([1.0], dtype=torch.float32)
    noise_scale_w = torch.tensor([1.0], dtype=torch.float32)

    x = x.unsqueeze(0)
    tones = tones.unsqueeze(0)

    filename = "model.onnx"
    torch.onnx.export(
        torch_model,
        (x, x_lengths, tones, sid, noise_scale, length_scale, noise_scale_w),
        filename,
        opset_version=13,
        input_names=["x", "x_lengths", "tones", "sid", "noise_scale", "length_scale", "noise_scale_w"],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

    meta_data = {
        "model_type": "melo-vits",
        "comment": "melo",
        "version": 2,
        "language": "English",
        "add_blank": int(config["data"]["add_blank"]),
        "n_speakers": config["data"]["n_speakers"],
        "jieba": 1,
        "sample_rate": config["data"]["sampling_rate"],
        "bert_dim": 1024,
        "ja_bert_dim": 768,
        "speaker_id": list(config["data"]["spk2id"].values())[0],
        "lang_id": language_id_map["EN"],
        "tone_start": language_tone_start_map["EN"],
        "url": "https://github.com/myshell-ai/MeloTTS",
        "license": "MIT license",
        "description": "MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai",
    }
    add_meta_data(filename, meta_data)

if __name__ == "__main__":
    main()

then in api.py i do:

class TTS(nn.Module):
    def __init__(self, 
                 language,
                 device='auto',
                 use_hf=True,
                 config_path=None,
                 ckpt_path=None):
        super().__init__()
        if device == 'auto':
            device = 'cpu'
            if torch.cuda.is_available():
                device = 'cuda'
            if torch.backends.mps.is_available():
                device = 'mps'
        if 'cuda' in device:
            assert torch.cuda.is_available()

        # Load configuration from your custom config_path
        if config_path:
            hps = utils.get_hparams_from_file(config_path)
        else:
            hps = load_or_download_config(language, use_hf=use_hf)

        num_languages = hps.num_languages
        num_tones = hps.num_tones
        symbols = hps.symbols

        model = SynthesizerTrn(
            len(symbols),
            hps.data.filter_length // 2 + 1,
            hps.train.segment_size // hps.data.hop_length,
            n_speakers=hps.data.n_speakers,
            num_tones=num_tones,
            num_languages=num_languages,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.symbol_to_id = {s: i for i, s in enumerate(symbols)}
        self.hps = hps
        self.device = device

        # load state_dict
        checkpoint_dict = load_or_download_model(language, device, use_hf=use_hf, ckpt_path=ckpt_path)
        self.model.load_state_dict(checkpoint_dict['model'], strict=True)

        language = language.split('_')[0]
        self.language = 'ZH_MIX_EN' if language == 'ZH' else language
csukuangfj commented 1 month ago

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?


please also post the logs if you use sherpa-onnx to generate the wav with your model.

csukuangfj commented 1 month ago

https://github.com/csukuangfj/onnxruntime-build/actions/runs/9184634501

You can see from the above link that we can successfully build a debug version of static lib.

nanaghartey commented 1 month ago

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

custom model 1 : Eng, news (african accent)

text - "things to look out for in the year 2020"

.pth generated wav -

https://github.com/user-attachments/assets/b6ca93ad-c38c-412c-8c6e-45e8b6e28a84

onnx generated wav -

https://github.com/user-attachments/assets/6dead35d-4ced-4883-827c-2b7cda9941fc

custom model 2 - Eng, singing (us accent)

text - "next time won't you sing with me"

.pth generated wav -

https://github.com/user-attachments/assets/4ce3f2be-a7ea-404d-be90-f4e80d712ab3

onnx generated wav -

https://github.com/user-attachments/assets/d7b0ce43-dca6-48ad-b2b6-33dc28a5ef31

i use sherpa-onnx but don't get logs. I was only trying out Melo on sherpa so models were not trained for long (training is not the issue though)

I hope you're able to spot the issue. Thanks

nanaghartey commented 1 month ago

@csukuangfj I can also share my model.pth and config.json files if that'd help.

csukuangfj commented 1 month ago

When you use .pth to test your model, can you zero out the bert part and try again?

nanaghartey commented 1 month ago

When you use .pth to test your model, can you zero out the bert part and try again?

The results is still better than onnx's when i zero out the bert part

csukuangfj commented 1 month ago

Could you show the code about how you did that?

nanaghartey commented 1 month ago

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

csukuangfj commented 1 month ago

could you please post the complete code?

nanaghartey commented 1 month ago

could you please post the complete code?

def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
        language = self.language
        texts = self.split_sentences_into_pieces(text, language, quiet)
        audio_list = []
        if pbar:
            tx = pbar(texts)
        else:
            if position:
                tx = tqdm(texts, position=position)
            elif quiet:
                tx = texts
            else:
                tx = tqdm(texts)
        for t in tx:
            if language in ['EN', 'ZH_MIX_EN']:
                t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
            device = self.device
            bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
            #bert = torch.zeros_like(bert).to(device)
            #ja_bert = torch.zeros_like(ja_bert).to(device)
            with torch.no_grad():
                x_tst = phones.to(device).unsqueeze(0)
                tones = tones.to(device).unsqueeze(0)
                lang_ids = lang_ids.to(device).unsqueeze(0)
                bert = bert.to(device).unsqueeze(0)
                ja_bert = ja_bert.to(device).unsqueeze(0)
                x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
                del phones
                speakers = torch.LongTensor([speaker_id]).to(device)
                audio = self.model.infer(
                        x_tst,
                        x_tst_lengths,
                        speakers,
                        tones,
                        lang_ids,
                        bert,
                        ja_bert,
                        sdp_ratio=sdp_ratio,
                        noise_scale=noise_scale,
                        noise_scale_w=noise_scale_w,
                        length_scale=1. / speed,
                    )[0][0, 0].data.cpu().float().numpy()
                del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
                # 
            audio_list.append(audio)
        torch.cuda.empty_cache()
        audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)

        if output_path is None:
            return audio
        else:
            if format:
                soundfile.write(output_path, audio, self.hps.data.sampling_rate, format=format)
            else:
                soundfile.write(output_path, audio, self.hps.data.sampling_rate)
csukuangfj commented 1 month ago

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

Could you change

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)

to

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()
nanaghartey commented 1 month ago

@csukuangfj

      bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

results is generated wav that sounds almost same as the original .pth inference (without zeroing out ) except for some few pronunciations that sound off. however it's way better than the wavs from onnx above. Here is the output with bert zeroed out:

https://github.com/user-attachments/assets/79b12910-318a-432d-8d08-0687d31e566b

https://github.com/user-attachments/assets/237d7511-a562-4674-b893-ddb2e5de54ea

I then tried :

  bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        bert.zero_()
        ja_bert.zero_()

    in export-onnx.py for onnx conversion but i got same "wrong" results shared earlier    
csukuangfj commented 1 month ago

Please compare the inputs to the model manually and see if they are the same.

nanaghartey commented 1 month ago

Please compare the inputs to the model manually and see if they are the same.

my .pth has:

BERT input shape: torch.Size([1024, 71])
JA_BERT input shape: torch.Size([768, 71])
Phones input shape: torch.Size([71])
Tones input shape: torch.Size([71])
Language IDs shape: torch.Size([71])

What changes can i make to the onnx export script please, or any other way to get this EN model to work with sherpa onnx tts :( ?

csukuangfj commented 1 month ago

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/melo-tts/test.py

please use this script to test the onnx model.

By comparing the model inputs, I mean comparing the value of the inputs, including the shape.

nanaghartey commented 1 month ago

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/melo-tts/test.py

please use this script to test the onnx model.

By comparing the model inputs, I mean comparing the value of the inputs, including the shape.

Below is the output. What next step should i take please?

Dumping model to file cache /var/folders/vf/13g26rdn3673b6cxhlhy03yh0000gn/T/jieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built successfully.
这是
t 这是
w 这
这
w 是
是
一个
使用
t 使用
w 使
使
w 用
用

next

generation

kaldi

的

text

to

speech

中英文
t 中英文
w 中
中
w 英
英
w 文
文
例子
.

Thank
t Thank
w T
T
t T
w h
h
w a
a
w n
n
w k
k

you
!

你
觉得
如何
呢
?

are

you

ok
?

Fantastic
t Fantastic
w F
F
t F
w a
a
w n
n
w t
t
w a
a
w s
s
w t
t
w i
i
w c
c
!

How
t How
w H
H
t H
w o
o
w w
w

about

you
?
torch.Size([265]) torch.Size([265])
torch.Size([1, 265]) torch.Size([1, 265])
csukuangfj commented 1 month ago

Please compare the input tensor values.

nanaghartey commented 1 month ago

@csukuangfj I tried that but still didn't get good results. I even tried exporting the official Melo tts models on huggingface - https://huggingface.co/myshell-ai

At this point i think i may just have to continue using piper/coqui though it doesn't sound as good as MeloTTS. Thanks for all the support :)

csukuangfj commented 1 month ago

By the way, the comparison is to help debugging.

dhc45010 commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

csukuangfj commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

目前解决不了这个问题。

dhc45010 commented 1 month ago

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

目前解决不了这个问题。

好的 感谢

studionexus-lk commented 3 weeks ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

csukuangfj commented 3 weeks ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

nanaghartey commented 2 weeks ago

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

Is there one for English only? In future if there is a way to convert a standard English model from the official training script can you share here? Thanks

csukuangfj commented 2 weeks ago

Sorry, I only have this one.