Generating speech locally in the web browser

lukestanley commented 9 months ago

It would be awesome if Piper's awesome TTS could generate the audio locally in the browser e.g: on an old phone, but the dependency on ONNX and the eSpeak variant makes this tricky. Streaming audio to and from a server is often fine but generating the audio locally could avoid needing to setup server infrastructure, and once cached could be faster, more private and work offline, without caring about network dead spots. It could be great for browser extensions too.

There is an eSpeak-ng "espeakng.js" demo here: https://www.readbeyond.it/espeakng/ With source here: https://github.com/espeak-ng/espeak-ng/tree/master/emscripten

Obviously it's not quite as magical as Piper but I think it's exciting. I can happily hack stuff together with Python and Docker, but I'm out of my depth with compiling stuff to different architectures, so after having a look, I'm backing off for now, but I thought I'd share what I learned in case others with relevant skills were also interested:

Both eSpeak-ng and ONNX Runtime Web have different ways of being compiled, but it turns out that they both are run in browsers via Emscripten.

For whatever it's worth, someone else has a another way of building a subset here: https://github.com/ianmarmour/espeak-ng.js/tree/main

There are ONNX web runtimes too.

ONNX Runtime Web, shares it's parent projects, really massive Python build helper script, but there is a quite helpful FAQ, that indicates it has a static builds, demonstrated with build info too: https://onnxruntime.ai/docs/build/web.html https://www.npmjs.com/package/onnxruntime-web

Footnote:

I did have a look at container2wasm for this too, but I couldn't quickly figure out how input and output of files would work. As well as looking at how Copy.sh's browser x86 emulator, v86 can use Arch with a successfully running Docker implementation! With v86 there are examples of doing input and output with files but getting everything working for x86 with 32 bit architecture seemed too complicated to me and might be a bit much, compared to compiling with Emscripten properly, even if it would potentially be usable for much more than cheekily running lots of arbitrary things in the browser.

P.S: awesome work @synesthesiam !

eschmidbauer commented 9 months ago

I believe piper can run in the browser using this Looks like a patch is required in piper I wonder if we can get that merged back to this repo so it's easier to build latest

lukestanley commented 9 months ago

Wow I see they got it working, they have a demo here: https://piper.wide.video Amazing work @jozefchutka!

The file sizes shown in the drop down is not correct, and the UI has lots of options to try out models, perhaps more than needed, but it works!! It even worked on Chrome on Android! It ran fairly fast for me after downloading. When testing with the VCTK voice weights, I got a real-time factor of 0.79 on my PC in Firefox (faster at generating the audio than the length of the audio). Real-time factor was 1.1 on my Android phone in Chrome (a bit slower than the actual audio). If it could start playing as soon as it had "enough" buffer of audio, that would probably be close to real time. I think that's amazing considering it's on device and runs all kinds of places. There are lots of things that could be optimised. This could be made into a great frontend library, possibly a shim, or it might be useful for some specific kinds of webapps or extensions directly, such as TTS extensions or voice chat apps. It won't be as fast on a lot of old devices but it's already close to working well enough for a lot of use cases. Regarding getting the https://github.com/wide-video/piper change into this repo, I expect with a bit of work, a reasonable change might possibly be made. I'm not well versed in C++ but it seems like the exact change made in https://github.com/wide-video/piper/commit/a8e4c8702ef124a438dc96659904da52cc1aba27 would need to be modified, to not break existing expected behaviour, and that's probably best done on top of latest master.

In the "WASM friendly" fork, a new command-line argument "--input" was added . It's used to parse JSON directly from the command line. A new JSON object input is initialised instead of reading from JSON from stdin, parts of the code for parsing JSON line by line are commented out, but parts that deal with the found attributes, remain. I think to cleanly integrate it, a command like argument to input JSON without stdin, is a good idea, and to avoid repeating code, some of the common logic would probably need extracting out. @jozefchutka and @synesthesiam if you could weigh in on that, it'd be appreciated. Anyway, awesome work!

@eschmidbauer I have to wonder, how did you find it?

jozefchutka commented 9 months ago

It would be great having piper compilable smoothly into wasm. The last time I tried, it took many manual steps to do so. Merging with https://github.com/wide-video/piper/commit/a8e4c8702ef124a438dc96659904da52cc1aba27 is just a tip of the iceberg.

a thread discussing some of the issues: https://github.com/rhasspy/piper-phonemize/issues/16
build process with the manual steps: https://github.com/wide-video/piper-wasm

csukuangfj commented 9 months ago

I would like to share the news with you guys that you can run all of the models from piper with web assembly using sherpa-onnx, one of the subprojects of the next-gen Kaldi

We have created a huggingface space so that you can try it. The address is https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en

The above huggingface space uses the following model from piper: https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-libritts_r-medium.tar.bz2

We also have a YouTube video to show you how to do that. https://www.youtube.com/watch?v=IcbbJBf01UI

Everything is open-sourced. If you want to know how web assembly is supported for piper, please see the following pull request: https://github.com/k2-fsa/sherpa-onnx/pull/577

There is one more thing to be improved:

[ ] Play the generated audio as it is still generating. It is feasible since it generates audio sentence by sentence and each sentence is processed independently.

FYI: In addition to running piper models with web assembly using sherpa-onnx, you can also run them on Android, iOS, Raspberry Pi, Linux, Windows, macOS, etc, with sherpa-onnx. All models from piper are supported by sherpa-onnx and you can find the converted models at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

csukuangfj commented 9 months ago

You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main

You can see that the wasm module file is only 11.5 MB.

eschmidbauer commented 9 months ago

@csukuangfj This is great, thanks so much !

gyroing commented 9 months ago

You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main

You can see that the wasm module file is only 11.5 MB.

@csukuangfj Superb job! but i wonder is it possible to extract voice model from .data file and load it in wasm worker separately(voice and tokens) during init function in javascript to possiblity to load different voices

csukuangfj commented 8 months ago

You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main

You can see that the wasm module file is only 11.5 MB.

@csukuangfj Superb job! but i wonder is it possible to extract voice model from .data file and load it in wasm worker separately(voice and tokens) during init function in javascript to possiblity to load different voices

Sorry that I don't know whether it is possible. I am very new to WebAssembly (only learned it for 3 days)

ken107 commented 7 months ago

https://piper.ttstool.com

Piper has been integrated into Read Aloud, and released as a separate extension as well.

The source code is here. Please help out if you can with some of the open issues.

jozefchutka commented 7 months ago

Following @ken107 work, I have updated https://piper.wide.video/ . Instead of whole piper being compiled into wasm, now it is 2 step process:

piper-phonemize as wasm (build steps) providing phoemeIds...
...consumed directly by onnxruntime

This already provides 4-8x improved performance when running on CPU.

Here is the simplest implementation https://piper.wide.video/poc.html

iSuslov commented 7 months ago

Sharing my Paste-n-Build solution based on @jozefchutka research.

#!/bin/bash
BUILD_DIR=$(pwd)/build-piper

rm -rf $BUILD_DIR && mkdir $BUILD_DIR

TMP=$BUILD_DIR/.tmp
[ ! -d $TMP ] && mkdir $TMP
DOCKERFILE=$TMP/piper_wasm_compile.Dockerfile

cat <<EOF > $DOCKERFILE
FROM debian:stable-slim
RUN apt-get update && \
    apt-get install --yes --no-install-recommends \
    build-essential \
    cmake \
    ca-certificates \
    curl \
    pkg-config \
    git \
    autogen \
    automake \
    autoconf \
    libtool \
    python3 && ln -sf python3 /usr/bin/python
RUN git clone --depth 1 https://github.com/emscripten-core/emsdk.git /modules/emsdk
WORKDIR /modules/emsdk
RUN ./emsdk install 3.1.41 && \
    ./emsdk activate 3.1.41 && \
    rm -rf downloads
WORKDIR /wasm
ENTRYPOINT ["/bin/bash", "-c", "EMSDK_QUIET=1 source /modules/emsdk/emsdk_env.sh  && \"\$@\"", "-s"]
CMD ["/bin/bash"]
EOF

docker buildx build -t piper-wasm-compiler -q -f $DOCKERFILE .

cat <<EOF | docker run --rm -i -v $TMP:/wasm piper-wasm-compiler /bin/bash
[ ! -d espeak-ng ] && git clone --depth 1 https://github.com/rhasspy/espeak-ng.git
cd /wasm/espeak-ng
./autogen.sh
./configure
make

cd /wasm
[ ! -d piper-phonemize ] && git clone --depth 1 https://github.com/wide-video/piper-phonemize.git
cd piper-phonemize && git pull
emmake cmake -Bbuild -DCMAKE_INSTALL_PREFIX=install -DCMAKE_TOOLCHAIN_FILE=\$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DBUILD_TESTING=OFF -G "Unix Makefiles" -DCMAKE_CXX_FLAGS="-O3 -s INVOKE_RUN=0 -s MODULARIZE=1 -s EXPORT_NAME='createPiperPhonemize' -s EXPORTED_FUNCTIONS='[_main]' -s EXPORTED_RUNTIME_METHODS='[callMain, FS]' --preload-file /wasm/espeak-ng/espeak-ng-data@/espeak-ng-data"
emmake cmake --build build --config Release # fails on "Compile intonations / Permission denied", continue with next steps
sed -i 's+\$(MAKE) \$(MAKESILENT) -f CMakeFiles/data.dir/build.make CMakeFiles/data.dir/build+#\0+g' /wasm/piper-phonemize/build/e/src/espeak_ng_external-build/CMakeFiles/Makefile2
sed -i 's/using namespace std/\/\/\0/g' /wasm/piper-phonemize/build/e/src/espeak_ng_external/src/speechPlayer/src/speechWaveGenerator.cpp
emmake cmake --build build --config Release
EOF

cp $TMP/piper-phonemize/build/piper_phonemize.* $BUILD_DIR

rm -rf $TMP

This script will automatically build and copy piper_phonemize.data piper_phonemize.wasm piper_phonemize.js into ./build-piper folder.

Under the hood this script will:

Build a smallest docker image. Well, it's 1.5gb instead of 1.9gb
Build piper-phonemize
Create ./build-piper folder and copy wasm artifacts in it.
Clean all temp files.

HirCoir commented 7 months ago

It would be awesome if Piper's awesome TTS could generate the audio locally in the browser e.g: on an old phone, but the dependency on ONNX and the eSpeak variant makes this tricky. Streaming audio to and from a server is often fine but generating the audio locally could avoid needing to setup server infrastructure, and once cached could be faster, more private and work offline, without caring about network dead spots. It could be great for browser extensions too.

There is an eSpeak-ng "espeakng.js" demo here: https://www.readbeyond.it/espeakng/ With source here: https://github.com/espeak-ng/espeak-ng/tree/master/emscripten

Obviously it's not quite as magical as Piper but I think it's exciting. I can happily hack stuff together with Python and Docker, but I'm out of my depth with compiling stuff to different architectures, so after having a look, I'm backing off for now, but I thought I'd share what I learned in case others with relevant skills were also interested:

Both eSpeak-ng and ONNX Runtime Web have different ways of being compiled, but it turns out that they both are run in browsers via Emscripten.

For whatever it's worth, someone else has a another way of building a subset here: https://github.com/ianmarmour/espeak-ng.js/tree/main

There are ONNX web runtimes too.

ONNX Runtime Web, shares it's parent projects, really massive Python build helper script, but there is a quite helpful FAQ, that indicates it has a static builds, demonstrated with build info too: https://onnxruntime.ai/docs/build/web.html https://www.npmjs.com/package/onnxruntime-web

Footnote:

I did have a look at container2wasm for this too, but I couldn't quickly figure out how input and output of files would work. As well as looking at how Copy.sh's browser x86 emulator, v86 can use Arch with a successfully running Docker implementation! With v86 there are examples of doing input and output with files but getting everything working for x86 with 32 bit architecture seemed too complicated to me and might be a bit much, compared to compiling with Emscripten properly, even if it would potentially be usable for much more than cheekily running lots of arbitrary things in the browser.

P.S: awesome work @synesthesiam !

https://github.com/HirCoir/HirCoir-Piper-tts-app

puppetm4st3r commented 6 months ago

@iSuslov can you provide a simple POC to test your work, i'm near to backend, but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js), i'm little lost wich where to start.

Also @jozefchutka if you can share your source code for your poc will be a good starting point to understand this artifacs.

Best regards!

csukuangfj commented 6 months ago

but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js)

@puppetm4st3r Do you want to try sherpa-onnx? It does exactly what you wish to: HTML + js +wasm. There's no need for any other dependencies.

Doc: https://k2-fsa.github.io/sherpa/onnx/tts/wasm/index.html

huggingface space demo for wasm + tts: https://k2-fsa.github.io/sherpa/onnx/tts/wasm/index.html

(Hint: You can copy the files from the huggingface space directly to your own project.)

puppetm4st3r commented 6 months ago

thanks! I'm following the doc but when I try to build the assets for an spanish model a got this stack trace.

LLVM ERROR: Broken module found, compilation aborted!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.  Program arguments: /home/dario/src/emsdk/upstream/bin/wasm-ld -o ../../bin/sherpa-onnx-wasm-main-tts.wasm CMakeFiles/sherpa-onnx-wasm-main-tts.dir/sherpa-onnx-wasm-main-tts.cc.o -L/home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib ../../lib/libsherpa-onnx-c-api.a ../../lib/libsherpa-onnx-core.a ../../lib/libkaldi-native-fbank-core.a ../../lib/libkaldi-decoder-core.a ../../lib/libsherpa-onnx-kaldifst-core.a ../../_deps/onnxruntime-src/lib/libonnxruntime.a ../../lib/libpiper_phonemize.a ../../lib/libespeak-ng.a /home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib/libonnxruntime.a -L/home/dario/src/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten ../../lib/libucd.a ../../lib/libsherpa-onnx-fstfar.a ../../lib/libsherpa-onnx-fst.a -lGL-getprocaddr -lal -lhtml5 -lstubs -lnoexit -lc -ldlmalloc -lcompiler_rt -lc++-noexcept -lc++abi-noexcept -lsockets -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr /tmp/tmpwhke9p6slibemscripten_js_symbols.so --strip-debug --export=CopyHeap --export=malloc --export=free --export=MyPrint --export=SherpaOnnxCreateOfflineTts --export=SherpaOnnxDestroyOfflineTts --export=SherpaOnnxDestroyOfflineTtsGeneratedAudio --export=SherpaOnnxOfflineTtsGenerate --export=SherpaOnnxOfflineTtsGenerateWithCallback --export=SherpaOnnxOfflineTtsNumSpeakers --export=SherpaOnnxOfflineTtsSampleRate --export=SherpaOnnxWriteWave --export=_emscripten_stack_alloc --export=__get_temp_ret --export=__set_temp_ret --export=__wasm_call_ctors --export=emscripten_stack_get_current --export=_emscripten_stack_restore --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-table -z stack-size=10485760 --max-memory=2147483648 --initial-memory=536870912 --no-entry --table-base=1 --global-base=1024
 #0 0x0000564a79ff0228 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf86228)
 #1 0x0000564a79fed65e llvm::sys::RunSignalHandlers() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf8365e)
 #2 0x0000564a79ff0e7f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007722bf8f7520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007722bf94b9fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)
 #5 0x00007722bf8f7476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
 #6 0x00007722bf8dd7f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
 #7 0x0000564a79f5e4c3 llvm::report_fatal_error(llvm::Twine const&, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xef44c3)
 #8 0x0000564a79f5e306 (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xef4306)
 #9 0x0000564a7ad5186e (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce786e)
#10 0x0000564a7c873a82 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x3809a82)
#11 0x0000564a7ad4aae1 llvm::lto::opt(llvm::lto::Config const&, llvm::TargetMachine*, unsigned int, llvm::Module&, bool, llvm::ModuleSummaryIndex*, llvm::ModuleSummaryIndex const*, std::__2::vector<unsigned char, std::__2::allocator<unsigned char>> const&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce0ae1)
#12 0x0000564a7ad4ca42 llvm::lto::backend(llvm::lto::Config const&, std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce2a42)
#13 0x0000564a7ad3c2aa llvm::lto::LTO::runRegularLTO(std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1cd22aa)
#14 0x0000564a7ad3b5c9 llvm::lto::LTO::run(std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, std::__2::function<llvm::Expected<std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>> (unsigned int, llvm::StringRef, llvm::Twine const&)>) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1cd15c9)
#15 0x0000564a7a3eca96 lld::wasm::BitcodeCompiler::compile() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1382a96)
#16 0x0000564a7a3eea74 lld::wasm::SymbolTable::compileBitcodeFiles() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1384a74)
#17 0x0000564a7a3d6135 lld::wasm::(anonymous namespace)::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) Driver.cpp:0:0
#18 0x0000564a7a3d1035 lld::wasm::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1367035)
#19 0x0000564a79ff325e lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf8925e)
#20 0x0000564a79f37481 lld_main(int, char**, llvm::ToolContext const&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xecd481)
#21 0x0000564a79f37e64 main (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xecde64)
#22 0x00007722bf8ded90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#23 0x00007722bf8dee40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#24 0x0000564a79eace2a _start (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xe42e2a)
em++: error: '/home/dario/src/emsdk/upstream/bin/wasm-ld -o ../../bin/sherpa-onnx-wasm-main-tts.wasm CMakeFiles/sherpa-onnx-wasm-main-tts.dir/sherpa-onnx-wasm-main-tts.cc.o -L/home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib ../../lib/libsherpa-onnx-c-api.a ../../lib/libsherpa-onnx-core.a ../../lib/libkaldi-native-fbank-core.a ../../lib/libkaldi-decoder-core.a ../../lib/libsherpa-onnx-kaldifst-core.a ../../_deps/onnxruntime-src/lib/libonnxruntime.a ../../lib/libpiper_phonemize.a ../../lib/libespeak-ng.a /home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib/libonnxruntime.a -L/home/dario/src/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten ../../lib/libucd.a ../../lib/libsherpa-onnx-fstfar.a ../../lib/libsherpa-onnx-fst.a -lGL-getprocaddr -lal -lhtml5 -lstubs -lnoexit -lc -ldlmalloc -lcompiler_rt -lc++-noexcept -lc++abi-noexcept -lsockets -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr /tmp/tmpwhke9p6slibemscripten_js_symbols.so --strip-debug --export=CopyHeap --export=malloc --export=free --export=MyPrint --export=SherpaOnnxCreateOfflineTts --export=SherpaOnnxDestroyOfflineTts --export=SherpaOnnxDestroyOfflineTtsGeneratedAudio --export=SherpaOnnxOfflineTtsGenerate --export=SherpaOnnxOfflineTtsGenerateWithCallback --export=SherpaOnnxOfflineTtsNumSpeakers --export=SherpaOnnxOfflineTtsSampleRate --export=SherpaOnnxWriteWave --export=_emscripten_stack_alloc --export=__get_temp_ret --export=__set_temp_ret --export=__wasm_call_ctors --export=emscripten_stack_get_current --export=_emscripten_stack_restore --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-table -z stack-size=10485760 --max-memory=2147483648 --initial-memory=536870912 --no-entry --table-base=1 --global-base=1024' failed (received SIGABRT (-6))
make[2]: *** [wasm/tts/CMakeFiles/sherpa-onnx-wasm-main-tts.dir/build.make:111: bin/sherpa-onnx-wasm-main-tts.js] Error 1
make[1]: *** [CMakeFiles/Makefile2:1281: wasm/tts/CMakeFiles/sherpa-onnx-wasm-main-tts.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

Installed cmake with apt-get, and then with pip. both got me that stack trace... Do you know what it could be happened?

the model selected was: https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-es_MX-claude-high.tar.bz2 Best regards!

csukuangfj commented 6 months ago

How much RAM does your computer have?

Could you try make -j1?

@puppetm4st3r

puppetm4st3r commented 6 months ago

64gb, free at least 90%, will try!

csukuangfj commented 6 months ago

64gb, free at least 90%, will try!

Does it work now?

puppetm4st3r commented 6 months ago

yes! thanks!

iSuslov commented 6 months ago

@iSuslov can you provide a simple POC to test your work, i'm near to backend, but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js), i'm little lost wich where to start.

Hey @puppetm4st3r, I see your issue is resolved, but in case my script seems to be confusing I would like to clarify:

Open bash terminal. Go to any folder.
Copy-paste script.

Script will download and compile everything it needs producing wasm build in the same folder. Docker should be preinstalled.

puppetm4st3r commented 6 months ago

thanks, now i got another issue, when I compile with your script @iSuslov it works like a charm on desktop webbrowser, but did not work on iOs with OOM error. but when tryed the other solution from @csukuangfj it work on iphone, but cant get running for spanish models wih @csukuangfj method. I'm stuck :(

csukuangfj commented 6 months ago

but cant get running for spanish models wih @csukuangfj method.

Could you describe in detail about why you cannot run it?

puppetm4st3r commented 6 months ago

when I tried your advice it finally didn't work, it was a false positive my mistake, cache wont refresh and I was testing with the solution from @iSuslov, it still send me the stack trace that I attached here. But if I clone your sample code with the model in English it works (with no building process, just the sample code with wasm binaries). Tryed to compile inside a clean docker and outside the docker on my machine, both didnt work.

Script from @iSuslov works, but when I tryid on iOS crash with OOM, your sample from HF space works on iOS without problem.

csukuangfj commented 6 months ago

Tryed to compile inside a clean docker and outside the docker on my machine, both didnt work

Would be great if you can post error logs. Otherwise, we don't know what you mean when you say didn't work. @puppetm4st3r

iSuslov commented 6 months ago

@puppetm4st3r just out of curiosity, when you say you testing it in iOS do you mean you test it in Safari on iPhone? I've never faced any OOM issues with wasm. Maybe there is an issue in how this script is loaded.

puppetm4st3r commented 6 months ago

I tried on iOS iphone safari/chrome but I realized that it is not the wasm for some very strange reason if I test my device using my private network address 192.168.x. and everything works fine I just discovered it, however when accessing the same device via router by public IP fails with an OOM error which makes no sense, I will remotely debug the iPhone and bring you the logs and evidence to leave the case documented in case it is of use to someone. I hope I can solve it now that I know that apparently it is an infrastructure problem...

puppetm4st3r commented 6 months ago

@csukuangfj will post later the logs (are very long) maybe I will upload to drive or something...

puppetm4st3r commented 6 months ago

@iSuslov additionally I have tested on Android and works fine, the problem is with iOS when exposing the service thru the cloud, so I think is a problem with the infra. But still cant build with the guide from @csukuangfj (have pending to attach the logs of the build process)

k9p5 commented 4 months ago

For those who are looking for a ready-to-use solution, I have compiled all the knowledge shared in this thread into this library: https://github.com/diffusion-studio/vits-web .

Thanks to everyone here for the awesome solutions and code snippets!

guest271314 commented 3 months ago

Re "in the web browser" is tricky because we have to find someway to load these voice files for each time the voice is used, on each origin the voice is used.

There is Native Messaging where we can run/control/communicate to and from native applications from the browser.

This native-messaging-espeak-ng is one variation of what I've been doing with eSpeak-NG for years now, mainly because I wanted to support SSML input (see SSMLParser), which I don't see mentioned here at all.

What this (using Native Messaging) means is that we don't have to compile anything to WASM. We can use piper as-is, send inoput to piper and send the output to the browser.

guest271314 commented 3 months ago

An option for using piper and onnx voices in the browser is through Speech Dispatcher, which Chromium-based browsers (Chrome, Brave, Opera, Edge) and Firefox use for Web Speech API.

I have added the piper module to Speech Dispatcher following the instructions here module request: piper #866.

Tested on Chromium Version 128.0.6586.0 (Developer Build) (64-bit) and Firefox Nightly 130.0a1. Chromium works. Firefox does not load the piper voices.

In pertinent part.

Download the piper executable from releases, extract the contents and save to ~/.local/opt/piper.

Download a couple .onnx files and save to ~/.local/share/piper/voices.

Create a symbolic link to piper executable in ~/.local/bin, ln -s ~/.local/opt/piper/piper piper.

Install python3-speechd

sudo apt install python3-speechd
spd-conf -u

Modify ~/.config/speech-dispatcher/speechd.conf to add the piper module

 AddModule "piper"                     "sd_generic"   "piper.conf"

or set piper as the default module

 DefaultModule espeak-ng
 # piper

Create ~/.config/speech-dispatcher/modules/piper.conf

Debug 0

GenericExecuteSynth "printf %s \'$DATA\' | /home/xubuntu/.local/bin/piper --length_scale 1 --sentence_silence 0  --model ~/.local/share/piper/voices/$VOICE --output-raw | aplay -r 22050 -f S16_LE -t raw -"

# only use medium quality voices to respect the 22050 rate for aplay in the command above.

GenericCmdDependency "piper"
GenericCmdDependency "aplay"
GenericCmdDependency "printf"
GenericSoundIconFolder "/usr/share/sounds/sound-icons/"

GenericPunctNone ""
GenericPunctSome "--punct=\"()<>[]{}\""
GenericPunctMost "--punct=\"()[]{};:\""
GenericPunctAll "--punct"

#GenericStripPunctChars  ""

GenericLanguage  "en" "en_US" "utf-8"

AddVoice        "en"    "MALE1"         "en_US-hfc_male-medium.onnx"
AddVoice        "en"    "FEMALE1"       "en_US-hfc_female-medium.onnx"

DefaultVoice    "en_US-hfc_male-medium.onnx"

#GenericRateForceInteger 1
#GenericRateAdd 1
#GenericRateMultiply 100

Restart speech-dispatcher with speech-dispatcher restart.

Terminate and restart chrome, killall -9 chrome.

Open DevTools, test in console

var voices = speechSynthesis.getVoices().filter(({name}) => name.includes("piper"));
var u = new SpeechSynthesisUtterance();
u.voice = voices[0];
u.text = "Test, test, test. Test to the point it breaks.";
speechSynthesis.speak(u);

console.log(JSON.stringify(voices.map(({default:_default, lang, localService, name, voiceURI}) => ({_default, lang, localService, name, voiceURI})), null, 2));

VM924:6 [
  {
    "_default": false,
    "lang": "en",
    "localService": true,
    "name": "en_US-hfc_female-medium.onnx piper",
    "voiceURI": "en_US-hfc_female-medium.onnx piper"
  },
  {
    "_default": false,
    "lang": "en",
    "localService": true,
    "name": "en_US-hfc_male-medium.onnx piper",
    "voiceURI": "en_US-hfc_male-medium.onnx piper"
  }
]

C-Loftus commented 1 week ago

Has anyone here managed to get GPU inference working in the browser? Seems like this could provide massive speedups and would be especially useful for long form content like video narrations or audiobook generation.

From what I can see, the current packages for piper in the browser are as follows, but neither support GPU inference.

https://github.com/diffusionstudio/vits-web
- Built via https://github.com/diffusionstudio/piper-wasm
- GPU issue: https://github.com/diffusionstudio/vits-web/issues/3
https://github.com/ken107/read-aloud
- With the piper code located at https://github.com/ken107/piper-browser-extension
- GPU Issue: https://github.com/ken107/read-aloud/issues/424

Wanted to raise this here as a central spot so work is not duplicated. Would transformers.js be usable since piper is an onnx model? Or do we need something else?

I am looking to create a web-based audiobook generation program similar to my CLI project QuickPiperAudiobook. Feel free to reach out if anyone is working on similar things / wants to hack on things together.

guest271314 commented 1 week ago

The vits-web version is slow. You have to load the WebAssembly module, and voices. Some voices are 60MB. Here's a fork of vits-web that you can test online for yourself https://guest271314.github.io/vits-web/.

I created a Native Messaging host to control the execution of the piper from the browser, with the output_raw PCM stream sent to the arbitrary Web page, then written to a MediaStreamTrack so the TTS output can be shared with any peer in the world that has WebRTC implemented. I also wrote a Web Audio API version that uses AudioWorklet for real-time playback of the raw PCM stream from piper, see https://github.com/guest271314/native-messaging-piper; background.js for MediaStreamTrackGenerator version, and background-aw.js AudioWorklet version.

guest271314 commented 1 week ago

@C-Loftus Ideally we compile piper to a WASM file, including the option to pass output_raw, so we can actually stream from WebAssembly/WASI. Without using Emscripten, so we don't have to deal with loading Workers; so the same code can be run in the browser and using wasmtime.

C-Loftus commented 1 week ago

Thanks for your work and context on that @guest271314 ! The native messaging work is very cool. I think I was hoping to have it run entirely in the browser with the GPU and no need to install on the host.

At least for my use case, I am fine loading the voice ever time it is used (I don't need real-time speed).

Ideally we compile piper to a WASM file, including the option to pass output_raw

Isn't this WASM compilation already done at https://github.com/diffusionstudio/piper-wasm ? Don't we just need an integration from transformers.js or wonnx?

I am not as familiar with some of the lower level browser APIs so sorry if I am missing a connection between them and WebGPU you are trying to point out.

guest271314 commented 1 week ago

At least for my use case, I am fine loading the voice ever time it is used (I don't need real-time speed).

Then you should be able to use the fork and/or the main vits-web code.

I think I was hoping to have it run entirely in the browser with the GPU and no need to install on the host.

The example runs in the browser.

Isn't this WASM compilation already done at https://github.com/diffusionstudio/piper-wasm ? Don't we just need an integration from transformers.js or wonnx?

If you look at the source code of the GitHub Pages example, the Emscripten generated code is JavaScript, not .wasm. onnx-runtime-web is used.

https://github.com/guest271314/vits-web/blob/patch-1/docs/index.js#L1-L2

import { createPiperPhonemize } from "./piper.js";
import * as ort from "./onyx-runtimeweb.js";

Ideally we just use the gloabal WebAssembly itself to define piper in its entirety in the single file .wasm file. At least that's how I see it.

Something like this

// https://www.webassemblyman.com/webassembly_wat_hello_world.html
// https://gist.github.com/cure53/f4581cee76d2445d8bd91f03d4fa7d3b

const wasm = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 8, 2, 96, 1, 127, 0, 96, 0, 0, 2, 15, 1, 3, 101, 110, 118, 7, 106, 115, 112, 114, 105, 110, 116, 0, 0, 3, 2, 1, 1, 5, 3, 1, 0, 1, 7, 27, 2, 10, 112, 97, 103, 101, 109, 101, 109, 111, 114, 121, 2, 0, 10, 104, 101, 108, 108, 111, 119, 111, 114, 108, 100, 0, 1, 10, 8, 1, 6, 0, 65, 0, 16, 0, 11, 11, 19, 1, 0, 65, 0, 11, 13, 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 0]);
class Go {
  constructor() {
    this.importObject = {
      env: {
        jsprint: function jsprint(byteOffset) {
          console.log(new TextDecoder().decode(new Uint8Array(memory.buffer).filter(Boolean)));
        },
      },
    };
  }
  run(_instance) {
    globalThis.memory = _instance.exports.pagememory;
    globalThis.helloworld = _instance.exports.helloworld;
  }
}
const go = new Go();
const {instance} = await WebAssembly.instantiateStreaming(fetch(URL.createObjectURL(new Blob([wasm],{
  type: 'application/wasm',
}))), go.importObject);
go.run(instance);
helloworld();

where instead of helloworld() we call piper(). No extra runtime stuff. Just the universal WebAssembly executable. That's one of the ideas that lead to WebAssembly, from my understanding.

Native Messaging works for me. I don't have an issue executing code on my own machine from the browser.

rhasspy / piper

Generating speech locally in the web browser #352

[ ] Play the generated audio as it is still generating. It is feasible since it generates audio sentence by sentence and each sentence is processed independently.