rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.18k stars 279 forks source link

FYI: Run models from piper with the Next-gen Kaldi subproject sherpa-onnx #251

Open csukuangfj opened 6 months ago

csukuangfj commented 6 months ago

FYI: We have supported piper models in https://github.com/k2-fsa/sherpa-onnx

Note that it does not depend on https://github.com/rhasspy/piper-phonemize

sherpa-onnx supports a variety of platforms, such as

It also provides various programming language APIs, e.g., C/C++/Python/Kotlin/Swift/C#/Go. We also have android APKs for TTS.

You can find the installation doc at https://k2-fsa.github.io/sherpa/onnx/install/index.html

You can find the usage of piper models with sherpa-onnx at https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#lessac-blizzard2013-medium-english-single-speaker Screen Shot 2023-10-26 at 15 43 05

We also have a huggingface space for you to try piper models with sherpa-onnx. Please visit https://huggingface.co/spaces/k2-fsa/text-to-speech

Screen Shot 2023-10-26 at 15 40 08


You can find the PR supporting piper in sherpa-onnx at https://github.com/k2-fsa/sherpa-onnx/pull/390

mush42 commented 6 months ago

@csukuangfj where to find the Android APKs?

beqabeqa473 commented 6 months ago

@csukuangfj Yes, it would be good to know about android tts as well. Could you please tell where to get it?

csukuangfj commented 6 months ago

I'm sorry for not getting back to you sooner.

I have been working on converting more models from piper.

Now all models of the following languages have been converted to sherpa-onnx:

You can find the Android APKs on the following page. https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

Screenshot 2023-10-29 at 17 48 27
beqabeqa473 commented 6 months ago

Are there using standard android text-to-speech api or not?

On 10/29/23, Fangjun Kuang @.***> wrote:

I'm sorry for not getting back to you sooner.

I have been working on converting more models from piper.

Now all models of the following languages have been converted to sherpa-onnx:

  • English (both US and GB)
  • French
  • German
  • Spanish (both ES and MX)

You can find the Android APKs on the following page. https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

<img width="901" alt="Screenshot 2023-10-29 at 17 48 27" src="https://github.com/rhasspy/piper/assets/5284924/c36b2eb7-ca4a-411d-8a03-48851a8d2c09">

-- Reply to this email directly or view it on GitHub: https://github.com/rhasspy/piper/issues/251#issuecomment-1784050942 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

csukuangfj commented 6 months ago

Are there using standard android text-to-speech api or not?

@beqabeqa473

No, it uses sherpa-onnx with vits pre-trained models for tts.

Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts

The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx

The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni

You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples

beqabeqa473 commented 6 months ago

Aah, ok, i ment standard tts-engine api bindings. I may try to do it in some future to use this tts as a standard andtoid tts engine for example with screenreaders.

On 10/29/23, Fangjun Kuang @.***> wrote:

Are there using standard android text-to-speech api or not?

@beqabeqa473

No, it uses sherpa-onnx with vits pre-trained models for tts.

Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts

The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx

The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni

You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples

-- Reply to this email directly or view it on GitHub: https://github.com/rhasspy/piper/issues/251#issuecomment-1784054415 You are receiving this because you were mentioned.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

synesthesiam commented 6 months ago

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

csukuangfj commented 6 months ago

@synesthesiam

but wasn't sure how to proceed.

We have detailed documentation at https://k2-fsa.github.io/sherpa/onnx/

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.


I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

The lexicon.txt is generated by following the colab notebook from this repo https://github.com/rhasspy/piper/blob/master/notebooks/piper_inference_(ONNX).ipynb

The exact code can be found at https://github.com/csukuangfj/models/tree/master/.github/scripts

Could you explain where the difference comes from?

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

No, it cannot. If there is an OOV at runtime, it is simply ignored, though a message is printed to tell the user that an OOV has been ignored.

I'd like to link to your work when you think it's stable enough;

Thank you! I think the support for offline VITS models is stable now. (The APIs for the VITS model are quite simple and there should be no big changes to the APIs in the near future)

synesthesiam commented 6 months ago

Could you tell us what you want to do? We can clarify the doc if you think it is not clear.

I meant more "big picture" in how I should proceed. I wasn't sure if it was worth investigating porting Piper to sherpa-onnx. I'd be curious if you've noticed any speed difference.

csukuangfj commented 5 months ago

Thanks for doing this @csukuangfj! I'd looked into sherpa-onnx at one point, but wasn't sure how to proceed. I'd like to link to your work when you think it's stable enough; I do want to make sure people understand that pronunciations may be slightly different due to the pre-computed lexicon.

Speaking of the lexicon, could it be extended dynamically at runtime with your approach?

@synesthesiam I am integrating piper-phonemize so that we can discard lexicon.txt in sherpa-onnx.

Could you have a look at the following two PRs?

csukuangfj commented 5 months ago
Screenshot 2023-11-29 at 21 51 51

https://huggingface.co/csukuangfj/vits-piper-pt_PT-tugao-medium/tree/main

I have converted all of the models from piper to sherpa-onnx. No lexicon.txt is required any more. I am using piper-phonemize.

(No that you can all run the models on Android/iOS/Raspberry Pi, etc).

anita-smith1 commented 5 months ago

@csukuangfj "No lexicon.txt is required any more. I am using piper-phonemize."

does this apply to piper models only? is lexicon required for coqui tts models? I'm following up on [#257] (https://github.com/rhasspy/piper/issues/257)

I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

csukuangfj commented 5 months ago

is lexicon required for coqui tts models?

No, it is also not required for coqui tts models

All vits models for coqui don't use lexicon.txt for sherpa-onnx.


I couldn't use my coqui tts converted sherpa onyx model because I had to manually add words to lexicon and there was poor pronunciation for single words.

Please look at just one coqui model at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

For instance, you can look at https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-coqui-en-ljspeech.tar.bz2

Download it, unzip it, and you will find the code for exporting models from coqui to sherpa-onnx.

anita-smith1 commented 5 months ago

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

csukuangfj commented 5 months ago

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

anita-smith1 commented 4 months ago

@csukuangfj meaning your notebook doesn't work anymore ? https://colab.research.google.com/drive/1cI9VzlimS51uAw4uCR-OBeSXRPBc4KoK?usp=sharing

I just updated the colab notebook. Please reload it.

@anita-smith1

The updated colab notebook is much much simpler than before.

Your colab notebook works for default vits models, but when I use my fine tuned vits model which contains words like "orrse", "atua" (not in the English dictionary) I get the error Error when reading tokens at Line <PAD> 0. size: 5 when I try to synthesize speech. Seems to be a token.txt issue

The first colab which used lexicons worked, but this does not work with a fine tuned model containing your own words. How can we solve this issue?

Screenshot 2023-12-08 at 16 28 38

csukuangfj commented 4 months ago

please show your meta data and add --debug=1 to your commandline.

anita-smith1 commented 4 months ago

--debug=1

meta_data {'model_type': 'vits', 'comment': 'coqui', 'language': 'English', 'voice': 'en-us', 'has_espeak': 1, 'add_blank': 1, 'blank_id': 3, 'n_speakers': 0, 'use_eos_bos': 0, 'bos_id': 2, 'eos_id': 1, 'sample_rate': 22050}

adding --debug=1, I have the output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline-tts --vits-model=./model.onnx --vits-tokens=./tokens.txt --vits-data-dir=./espeak-ng-data --output-filename=./test.wav --debug=1 'orrse wo betumi atua de a fa mobile' 

/project/sherpa-onnx/csrc/offline-tts-vits-model.cc:Init:79 ---vits model---
bos_id=2
use_eos_bos=0
n_speakers=0
blank_id=3
has_espeak=1
voice=en-us
sample_rate=22050
language=English
add_blank=1
comment=coqui
eos_id=1
model_type=vits
----------input names----------
0 input
1 input_lengths
2 scales
----------output names----------
0 output

/project/sherpa-onnx/csrc/piper-phonemize-lexicon.cc:ReadTokens:66 Error when reading tokens at Line <PAD> 0. size: 5
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[<ipython-input-13-c8218415962b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('shell', '', '\nsherpa-onnx-offline-tts \\\n --vits-model=./model.onnx \\\n --vits-tokens=./tokens.txt \\\n --vits-data-dir=./espeak-ng-data \\\n --output-filename=./test.wav \\\n --debug=1 \\\n "orrse wo betumi atua de a fa mobile"\n')

3 frames
[/usr/local/lib/python3.10/dist-packages/google/colab/_system_commands.py](https://localhost:8080/#) in check_returncode(self)
    135   def check_returncode(self):
    136     if self.returncode:
--> 137       raise subprocess.CalledProcessError(
    138           returncode=self.returncode, cmd=self.args, output=self.output
    139       )

CalledProcessError: Command '
sherpa-onnx-offline-tts \
 --vits-model=./model.onnx \
 --vits-tokens=./tokens.txt \
 --vits-data-dir=./espeak-ng-data \
 --output-filename=./test.wav \
 --debug=1 \
 "orrse wo betumi atua de a fa mobile"
' returned non-zero exit status 255.

and this is the generated token.txt file content:

<PAD> 0
<EOS> 1
<BOS> 2
<BLNK> 3
a 4
b 5
c 6
d 7
e 8
f 9
h 10
i 11
j 12
k 13
l 14
m 15
n 16
o 17
p 18
q 19
r 20
s 21
t 22
u 23
v 24
w 25
x 26
y 27
z 28
æ 29
ç 30
ð 31
ø 32
ħ 33
ŋ 34
œ 35
ǀ 36
ǁ 37
ǂ 38
ǃ 39
ɐ 40
ɑ 41
ɒ 42
ɓ 43
ɔ 44
ɕ 45
ɖ 46
ɗ 47
ɘ 48
ə 49
ɚ 50
ɛ 51
ɜ 52
ɞ 53
ɟ 54
ɠ 55
ɡ 56
ɢ 57
ɣ 58
ɤ 59
ɥ 60
ɦ 61
ɧ 62
ɨ 63
ɪ 64
ɫ 65
ɬ 66
ɭ 67
ɮ 68
ɯ 69
ɰ 70
ɱ 71
ɲ 72
ɳ 73
ɴ 74
ɵ 75
ɶ 76
ɸ 77
ɹ 78
ɺ 79
ɻ 80
ɽ 81
ɾ 82
ʀ 83
ʁ 84
ʂ 85
ʃ 86
ʄ 87
ʈ 88
ʉ 89
ʊ 90
ʋ 91
ʌ 92
ʍ 93
ʎ 94
ʏ 95
ʐ 96
ʑ 97
ʒ 98
ʔ 99
ʕ 100
ʘ 101
ʙ 102
ʛ 103
ʜ 104
ʝ 105
ʟ 106
ʡ 107
ʢ 108
ʲ 109
ˈ 110
ˌ 111
ː 112
ˑ 113
˞ 114
β 115
θ 116
χ 117
ᵻ 118
ⱱ 119
! 120
' 121
( 122
) 123
, 124
- 125
. 126
: 127
; 128
? 129
  130
csukuangfj commented 4 months ago

Could you share your config.json?

The English VITS models from coqui use phonemes. All other non-English models from coqui use Characters.

csukuangfj commented 4 months ago

From your config.json:

    "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",

Unfortunately, we don't support models using IPAPhonemes, only Graphmes and VitsCharacters are supported from coqui-ai/tts.

You can find all supported models at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

You can find the script for converting the model by unzipping the downloaded file.

anita-smith1 commented 4 months ago

@csukuangfj how can I fine-tune my model to support this ? I shared the colab notebook I used in my previous message. Can you take a look ? Is it possible to change the configuration and re-fine tune my model? In case that’s not possible and I decide to train/fine tune using piper , do you have a similar colab notebook for converting piper model to onnx ?

csukuangfj commented 4 months ago

Please download a model and unzip it, you will find the converting script.

anita-smith1 commented 4 months ago

@csukuangfj I have fine tuned a model with characters_class="TTS.tts.models.vits.VitsCharacters" and I'm able to synthesis now using your colab notebook. it is working :) Thanks a lot. Now I want to try on android and iOS but I can see android uses the old code below. Will it ignore the lexicon file?

fun getOfflineTtsConfig(
    modelDir: String,
    modelName: String,
    lexicon: String,
    dataDir: String,
    ruleFsts: String
): OfflineTtsConfig? {
    return OfflineTtsConfig(
        model = OfflineTtsModelConfig(
            vits = OfflineTtsVitsModelConfig(
                model = "$modelDir/$modelName",
                lexicon = "$modelDir/$lexicon",
                tokens = "$modelDir/tokens.txt",
                dataDir = "$dataDir"
            ),
            numThreads = 2,
            debug = true,
            provider = "cpu",
        ),
        ruleFsts = ruleFsts,
    )
csukuangfj commented 4 months ago

please see where and how this function is called.

csukuangfj commented 4 months ago

Please see https://github.com/k2-fsa/sherpa-onnx/blob/master/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L172

https://github.com/k2-fsa/sherpa-onnx/blob/0f053d80408b70efde3c8a37f5eeed1c5fd7f837/android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt#L167-L183


        // Example 1:
        // modelDir = "vits-vctk"
        // modelName = "vits-vctk.onnx"
        // lexicon = "lexicon.txt"

        // Example 2:
        // https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
        // https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-amy-low.tar.bz2
        // modelDir = "vits-piper-en_US-amy-low"
        // modelName = "en_US-amy-low.onnx"
        // dataDir = "vits-piper-en_US-amy-low/espeak-ng-data"

        // Example 3:
        // modelDir = "vits-zh-aishell3"
        // modelName = "vits-aishell3.onnx"
        // ruleFsts = "vits-zh-aishell3/rule.fst"
        // lexcion = "lexicon.txt"

In your case, please use Example 2.

@anita-smith1

anita-smith1 commented 4 months ago

@csukuangfj Thanks a lot for your patience. I'm learning a lot as a beginner. I have run the android app with version 1.9.3 .so files and it worked but I had to make some changes to the initAudioTrack() function. It crashed with an invalid audio buffer size :

java.lang.RuntimeException: Unable to start activity ComponentInfo{com.k2fsa.sherpa.onnx/com.k2fsa.sherpa.onnx.MainActivity}: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                        at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4184)
                                                                                                        at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340)
                                                                                                        at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101)
                                                                                                        at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135)
                                                                                                        at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95)
                                                                                                        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584)
                                                                                                        at android.os.Handler.dispatchMessage(Handler.java:106)
                                                                                                        at android.os.Looper.loopOnce(Looper.java:226)
                                                                                                        at android.os.Looper.loop(Looper.java:313)
                                                                                                        at android.app.ActivityThread.main(ActivityThread.java:8810)
                                                                                                        at java.lang.reflect.Method.invoke(Native Method)
                                                                                                        at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604)
                                                                                                        at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067)
                                                                                                    Caused by: java.lang.IllegalArgumentException: Invalid audio buffer size.
                                                                                                        at android.media.AudioTrack.audioBuffSizeCheck(AudioTrack.java:1955)
                                                                                                        at android.media.AudioTrack.<init>(AudioTrack.java:810)
                                                                                                        at android.media.AudioTrack.<init>(AudioTrack.java:752)
                                                                                                        at com.k2fsa.sherpa.onnx.MainActivity.initAudioTrack(MainActivity.kt:78)
                                                                                                        at com.k2fsa.sherpa.onnx.MainActivity.onCreate(MainActivity.kt:40)
                                                                                                        at android.app.Activity.performCreate(Activity.java:8657)
                                                                                                        at android.app.Activity.performCreate(Activity.java:8636)
                                                                                                        at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1417)
                                                                                                        at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:4165)
                                                                                                        at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:4340) 
                                                                                                        at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:101) 
                                                                                                        at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:135) 
                                                                                                        at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:95) 
                                                                                                        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2584) 
                                                                                                        at android.os.Handler.dispatchMessage(Handler.java:106) 
                                                                                                        at android.os.Looper.loopOnce(Looper.java:226) 
                                                                                                        at android.os.Looper.loop(Looper.java:313) 
                                                                                                        at android.app.ActivityThread.main(ActivityThread.java:8810) 
                                                                                                        at java.lang.reflect.Method.invoke(Native Method) 
                                                                                                        at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:604) 
                                                                                                        at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1067) 

I had to change the original to the the version below which worked, but I'm not sure if it has any implications:

private fun initAudioTrack() {
        val sampleRate = tts.sampleRate()
        val minBufferSize = AudioTrack.getMinBufferSize(
            sampleRate,
            AudioFormat.CHANNEL_OUT_MONO,
            AudioFormat.ENCODING_PCM_FLOAT
        )

        // Check if getMinBufferSize returned a valid size
        if (minBufferSize == AudioTrack.ERROR || minBufferSize == AudioTrack.ERROR_BAD_VALUE) {
            Log.e(TAG, "Invalid minimum buffer size: $minBufferSize")
            return
        }

        // Ensure buffer size is at least 0.1 seconds of audio or the minimum buffer size, whichever is larger
        val bufLength = max((sampleRate * 0.1).toInt(), minBufferSize)
        Log.i(TAG, "sampleRate: $sampleRate, bufLength: $bufLength")

        val attr = AudioAttributes.Builder()
            .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
            .setUsage(AudioAttributes.USAGE_MEDIA)
            .build()

        val format = AudioFormat.Builder()
            .setEncoding(AudioFormat.ENCODING_PCM_FLOAT)
            .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
            .setSampleRate(sampleRate)
            .build()

        try {
            track = AudioTrack(attr, format, bufLength, AudioTrack.MODE_STREAM, AudioManager.AUDIO_SESSION_ID_GENERATE)

            // Check if AudioTrack is initialized properly
            if (track.state != AudioTrack.STATE_INITIALIZED) {
                Log.e(TAG, "AudioTrack initialization failed")
                return
            }

            track.play()
        } catch (e: IllegalArgumentException) {
            Log.e(TAG, "AudioTrack initialization failed: ${e.message}")
        }
    }
csukuangfj commented 4 months ago

Thanks! Would you mind making a PR to fix it?

anita-smith1 commented 4 months ago

Thanks! Would you mind making a PR to fix it?

The working code is from ChatGPT. I don't know why it works. I asked it why the app crashed and it told me why with a solution. I think you need to first check and confirm it does not cause any other issue before making a PR. Example, in your recent video on twitter (X), synthesis is very fast but mine is a bit slow, so not sure if it's due to the code. Thanks

csukuangfj commented 4 months ago

I just fixed it in the master branch.

I am using a small model in the video. How large is your model?

anita-smith1 commented 4 months ago

Okay that's great. Hope you will soon fix the single word pronunciation issue too. My model size is 145MB

csukuangfj commented 4 months ago

Okay that's great. Hope you will soon fix the single word pronunciation issue too. My model size is 145MB

If you can reduce it to ~70MB , then it should be much faster.

Could you describe the issue of single word pronunciation?

anita-smith1 commented 4 months ago

Okay but I'm not sure how to reduce it. Coqui tts models generally have the same size. I've taken a second look and seems single word pronunciations is better now, for English words. Once I have a good model I will do a video demo and share with you. Thanks

aaronnewsome commented 4 months ago

I went down the rabbit hole of trying to use my trained from scratch onnx voice with sherpa-onnx. After a few hours, I just gave up. The docs point to this link to export the voice:

https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-ljs.py

After my attempt to follow along with instructions, I try running the export-onnx-ljs.py script and it complains about not being able to load monotonic_align.monotonic_align, as follows:

/home/anewsome/src/sherpa-onnx/scripts/vits/export-onnx-ljs.py
Traceback (most recent call last):
  File "/home/anewsome/src/sherpa-onnx/scripts/vits/export-onnx-ljs.py", line 46, in <module>
    from models import SynthesizerTrn
  File "/home/anewsome/src/vits/models.py", line 10, in <module>
    import monotonic_align
  File "/home/anewsome/src/vits/monotonic_align/__init__.py", line 3, in <module>
    from .monotonic_align.core import maximum_path_c
ModuleNotFoundError: No module named 'monotonic_align.monotonic_align'

Has anyone been able to try sherpa-onnx with a piper voice? Are there some instructions that are a bit more clear on how to actually make it work. I'm interesting in trying it, mostly because sherpa-onnx claims to run on platforms other than linux intel, although I'd be ecstatic just to get to work on Linux.

anita-smith1 commented 4 months ago

From your config.json:

    "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",

Unfortunately, we don't support models using IPAPhonemes, only Graphmes and VitsCharacters are supported from coqui-ai/tts.

is there any chance you can bring back support for models using IPAPhonemes? perhaps add it as an option so those who would like to use such models can do so. I have noticed that my fine tuned model using IPAPhonemes for non English words (like names of people), has way better quality than the version using VitsCharacter.

csukuangfj commented 4 months ago

/home/anewsome/src/sherpa-onnx/scripts/vits/export-onnx-ljs.py

@aaronnewsome the above script is not for converting piper models.

As I said before, you can download just one piper model from https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models and unzip it, you will find the converting script. Any piper model is fine. The only thing is that you have to download one.

If you still have issues, please post your errors here.

csukuangfj commented 4 months ago

@aaronnewsome

I just wrote a detailed, step-by-step, guide about how to convert a piper vits pre-trained model to sherpa-onnx for you. You can find it at https://k2-fsa.github.io/sherpa/onnx/tts/piper.html

Screenshot 2023-12-13 at 11 05 51

csukuangfj commented 4 months ago

is there any chance you can bring back support for models using IPAPhonemes?

@anita-smith1

Sorry, it is not in the plan. The major difficulty is that the phonemizer used by IPAPhonemes is hard to port to C++.

As you know, you are training your model in Python, but if you want to deploy it, every part must be converted to C++, including the phonemizer.


All the VITS models from coqui-ai/tts are listed below.

# Graphemes
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--bg--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--cs--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--da--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--et--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ga--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--es--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fr--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--nl--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--de--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hu--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--fi--css10--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--hr--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--lv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--mt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pl--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--pt--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--ro--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sk--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sl--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--sv--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_male.zip
# wget https://coqui.gateway.scarf.sh/v0.13.3_models/tts_models--bn--custom--vits_female.zip

# IPAPhonemes
# wget https://coqui.gateway.scarf.sh/v0.7.0_models/tts_models--de--thorsten--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--el--cv--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.10.1_models/tts_models--ca--custom--vits.zip

# VitsCharacters
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_female--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.1_models/tts_models--it--mai_male--vits.zip
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--ewe--openbible--vits.zip # ewe
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--hau--openbible--vits.zip # hausa
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--lin--openbible--vits.zip # lingala
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_akuapem--openbible--vits.zip # akuapem-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--tw_asante--openbible--vits.zip # asante-twi
# wget https://coqui.gateway.scarf.sh/v0.6.2_models/tts_models--yor--openbible--vits.zip # yoruba

You can see that only 3 of them are using IPAPhonemes.

I suggest that you switch to

"characters_class": "TTS.tts.utils.text.characters.Graphemes",

or

"characters_class": "TTS.tts.models.vits.VitsCharacters",
csukuangfj commented 4 months ago

@anita-smith1

. I have noticed that my fine tuned model using IPAPhonemes for non English words (like names of people), has way better quality than the version using VitsCharacter.

You can also use espeak-ng in coqui-ai/tts, though I find that only English VITS models from coqui-ai/tts are using espeak-ng.

aaronnewsome commented 4 months ago

@aaronnewsome

I just wrote a detailed, step-by-step, guide about how to convert a piper vits pre-trained model to sherpa-onnx for you. You can find it at https://k2-fsa.github.io/sherpa/onnx/tts/piper.html

Thank you @csukuangfj , I honestly don't think I stumbled across all of these instructions while I was trying to do the conversion for the hours I was trying. It was much easier to do with the instructions you created.

I was able to use the sherpa-onnx-offline-tts example to create a wav with my custom voice trained from scratch. However, the quality was not very good at all. Lots of words with strange pronunciations. The words were pronounced much more accurately piper.

Also, the JSON file that piper preprocess created for me needed some changes for your script to run. The language key and espeak key didn't look the same as the en_US-amy-medium.onnx.json file I compared it to. In en_US-amy-medium.onnx.json there is:

"espeak": {
    "voice": "en-us"
  }

and

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  },

The json for my custom voice, trained from scratch only had this for language:

"language": {
        "code": "en"
    },

and also just "en" for espeak voice. This caused your example python script to error, so I adjusted the JSON manually. The JSON file for my onnx was created by piper preprocess, so maybe I used it wrong, which would explain why those fields are wrong/missing. I'll look into it some more.

anita-smith1 commented 4 months ago

@csukuangfj Please check if my configuration for fine tuning a Vits model using coqui is okay. I am not getting intelligible sound after fine tuning using VitsCharacter, even for English words/phrases. Seems I am doing something wrong:

code = """import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig, CharactersConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
#output_path = os.path.dirname(os.path.abspath(__file__))
##########################################
#Change this to your dataset directory
##########################################
output_path = "/content/drive/MyDrive/"""
code = code + dataset_name + "/" + output_directory + "/" + "\""

code=code + """
dataset_config = BaseDatasetConfig(
##########################################
#Change this to your dataset directory
##########################################
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/content/drive/MyDrive/"""
code = code + dataset_name
code=code + """")

)
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)
#i have added character config for sherpa onnx support
character_config = CharactersConfig (
     characters_class="TTS.tts.models.vits.VitsCharacters",
     pad="_",
     eos="",
     bos="",
     characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
     punctuations=';:,.!?¡¿—…"«»“” ',
     phonemes="ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
)
config = VitsConfig(
    audio=audio_config,
    characters=character_config,
    run_name="vits_ljspeech_ly",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=5,
#    num_loader_workers=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100000,
    save_step=1000,
    save_checkpoints=True,
    save_n_checkpoints=4,
    save_best_after=2000,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
)
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()
"""

I read this and seems he fixed the issue by setting "use_phonemes=False", but I don't think that applies here.

csukuangfj commented 4 months ago

Sorry that I am not familiar with coqui-ai/tts. I suggest that you ask in the repo of coqui-ai/tts.

anita-smith1 commented 4 months ago

okay no problem. I am switching from coqui to piper since I'm facing some issues.

anita-smith1 commented 4 months ago

I am currently training using "use_phonemes=False" (coqui tts) and seems to be working so far. If it still doesn't work I will switch completely to piper. Piper has very good documentation

anita-smith1 commented 4 months ago

So I managed to get both coqui tts and piper working but I have decided to stick to piper because the model size is smaller than coqui tts therefore reducing latency. Piper seems to have better pronunciations too.

@csukuangfj I am not sure if you need to update script in model zip file.

pip install piper-phonemize onnx onnxruntime==1.16.0 returns:

ERROR: Could not find a version that satisfies the requirement piper-phonemize (from versions: none)
ERROR: No matching distribution found for piper-phonemize

changing the version to 1.16.1 doesn't work either.

so I changed topip install onnx onnxruntime.

Also, I had to manually change the json file to include:

"language": {
    "code": "en_US",
    "family": "en",
    "region": "US",
    "name_native": "English",
    "name_english": "English",
    "country_english": "United States"
  }

because the original export from piper only had

language": {
        "code": "en-us"
    }

Without changing the python script for exporting to sherpa-onnx will fail at :

"language": config["language"]["name_english"],

since there is no "name_english"

csukuangfj commented 4 months ago

Aah, ok, i ment standard tts-engine api bindings. I may try to do it in some future to use this tts as a standard andtoid tts engine for example with screenreaders. On 10/29/23, Fangjun Kuang @.> wrote: > Are there using standard android text-to-speech api or not? @beqabeqa473 No, it uses sherpa-onnx with vits pre-trained models for tts. Everything is open-sourced. You can find the source code for the android project at https://github.com/k2-fsa/sherpa-onnx/tree/master/android/SherpaOnnxTts The underlying C++ code can be found at https://github.com/k2-fsa/sherpa-onnx The JNI C++ binding code can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/sherpa-onnx/jni You can find kotlin API examples at https://github.com/k2-fsa/sherpa-onnx/tree/master/kotlin-api-examples -- Reply to this email directly or view it on GitHub: #251 (comment) You are receiving this because you were mentioned. Message ID: @.> -- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

@beqabeqa473

I just supported replacing the system TTS engine in https://github.com/k2-fsa/sherpa-onnx/pull/508

You can find a YouTube video at https://www.youtube.com/watch?v=33QYuVzDORA

nanaghartey commented 4 weeks ago

@csukuangfj when will Sherpa support coqui XTTS-v2 models?

csukuangfj commented 4 weeks ago

XTTS-v2

The model is larger than 1 GB, which requires a GPU, I think.

We won't support it in k2-fsa/sherpa-onnx, which is targeted mainly for embedded environment.

But we may support it in k2-fsa/sherpa, though we cannot say a time when it will be supported.

nanaghartey commented 1 week ago

@csukuangfj what about StyleTTS2 models which has elevenlabs human sounding quality and pytorch support https://github.com/yl4579/StyleTTS2

csukuangfj commented 1 week ago

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

nanaghartey commented 1 week ago

https://github.com/yl4579/StyleTTS2

Does it have onnx export support?

Not at the moment