Open lukestanley opened 9 months ago
Wow I see they got it working, they have a demo here: https://piper.wide.video Amazing work @jozefchutka!
The file sizes shown in the drop down is not correct, and the UI has lots of options to try out models, perhaps more than needed, but it works!! It even worked on Chrome on Android! It ran fairly fast for me after downloading. When testing with the VCTK voice weights, I got a real-time factor of 0.79 on my PC in Firefox (faster at generating the audio than the length of the audio). Real-time factor was 1.1 on my Android phone in Chrome (a bit slower than the actual audio). If it could start playing as soon as it had "enough" buffer of audio, that would probably be close to real time. I think that's amazing considering it's on device and runs all kinds of places. There are lots of things that could be optimised. This could be made into a great frontend library, possibly a shim, or it might be useful for some specific kinds of webapps or extensions directly, such as TTS extensions or voice chat apps. It won't be as fast on a lot of old devices but it's already close to working well enough for a lot of use cases. Regarding getting the https://github.com/wide-video/piper change into this repo, I expect with a bit of work, a reasonable change might possibly be made. I'm not well versed in C++ but it seems like the exact change made in https://github.com/wide-video/piper/commit/a8e4c8702ef124a438dc96659904da52cc1aba27 would need to be modified, to not break existing expected behaviour, and that's probably best done on top of latest master.
In the "WASM friendly" fork, a new command-line argument "--input" was added . It's used to parse JSON directly from the command line. A new JSON object input is initialised instead of reading from JSON from stdin, parts of the code for parsing JSON line by line are commented out, but parts that deal with the found attributes, remain. I think to cleanly integrate it, a command like argument to input JSON without stdin, is a good idea, and to avoid repeating code, some of the common logic would probably need extracting out. @jozefchutka and @synesthesiam if you could weigh in on that, it'd be appreciated. Anyway, awesome work!
@eschmidbauer I have to wonder, how did you find it?
It would be great having piper compilable smoothly into wasm. The last time I tried, it took many manual steps to do so. Merging with https://github.com/wide-video/piper/commit/a8e4c8702ef124a438dc96659904da52cc1aba27 is just a tip of the iceberg.
I would like to share the news with you guys that you can run all of the models from piper with web assembly using sherpa-onnx, one of the subprojects of the next-gen Kaldi
We have created a huggingface space so that you can try it. The address is https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en
The above huggingface space uses the following model from piper: https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-libritts_r-medium.tar.bz2
We also have a YouTube video to show you how to do that. https://www.youtube.com/watch?v=IcbbJBf01UI
Everything is open-sourced. If you want to know how web assembly is supported for piper, please see the following pull request: https://github.com/k2-fsa/sherpa-onnx/pull/577
There is one more thing to be improved:
FYI: In addition to running piper models with web assembly using sherpa-onnx, you can also run them on Android, iOS, Raspberry Pi, Linux, Windows, macOS, etc, with sherpa-onnx. All models from piper are supported by sherpa-onnx and you can find the converted models at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main
You can see that the wasm module file is only 11.5 MB.
@csukuangfj This is great, thanks so much !
You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main
You can see that the wasm module file is only 11.5 MB.
@csukuangfj Superb job! but i wonder is it possible to extract voice model from .data file and load it in wasm worker separately(voice and tokens) during init function in javascript to possiblity to load different voices
You can find the files for the above huggingface space at https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-en/tree/main
You can see that the wasm module file is only 11.5 MB.
@csukuangfj Superb job! but i wonder is it possible to extract voice model from .data file and load it in wasm worker separately(voice and tokens) during init function in javascript to possiblity to load different voices
Sorry that I don't know whether it is possible. I am very new to WebAssembly (only learned it for 3 days)
Piper has been integrated into Read Aloud, and released as a separate extension as well.
The source code is here. Please help out if you can with some of the open issues.
Following @ken107 work, I have updated https://piper.wide.video/ . Instead of whole piper being compiled into wasm, now it is 2 step process:
This already provides 4-8x improved performance when running on CPU.
Here is the simplest implementation https://piper.wide.video/poc.html
Sharing my Paste-n-Build solution based on @jozefchutka research.
#!/bin/bash
BUILD_DIR=$(pwd)/build-piper
rm -rf $BUILD_DIR && mkdir $BUILD_DIR
TMP=$BUILD_DIR/.tmp
[ ! -d $TMP ] && mkdir $TMP
DOCKERFILE=$TMP/piper_wasm_compile.Dockerfile
cat <<EOF > $DOCKERFILE
FROM debian:stable-slim
RUN apt-get update && \
apt-get install --yes --no-install-recommends \
build-essential \
cmake \
ca-certificates \
curl \
pkg-config \
git \
autogen \
automake \
autoconf \
libtool \
python3 && ln -sf python3 /usr/bin/python
RUN git clone --depth 1 https://github.com/emscripten-core/emsdk.git /modules/emsdk
WORKDIR /modules/emsdk
RUN ./emsdk install 3.1.41 && \
./emsdk activate 3.1.41 && \
rm -rf downloads
WORKDIR /wasm
ENTRYPOINT ["/bin/bash", "-c", "EMSDK_QUIET=1 source /modules/emsdk/emsdk_env.sh && \"\$@\"", "-s"]
CMD ["/bin/bash"]
EOF
docker buildx build -t piper-wasm-compiler -q -f $DOCKERFILE .
cat <<EOF | docker run --rm -i -v $TMP:/wasm piper-wasm-compiler /bin/bash
[ ! -d espeak-ng ] && git clone --depth 1 https://github.com/rhasspy/espeak-ng.git
cd /wasm/espeak-ng
./autogen.sh
./configure
make
cd /wasm
[ ! -d piper-phonemize ] && git clone --depth 1 https://github.com/wide-video/piper-phonemize.git
cd piper-phonemize && git pull
emmake cmake -Bbuild -DCMAKE_INSTALL_PREFIX=install -DCMAKE_TOOLCHAIN_FILE=\$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DBUILD_TESTING=OFF -G "Unix Makefiles" -DCMAKE_CXX_FLAGS="-O3 -s INVOKE_RUN=0 -s MODULARIZE=1 -s EXPORT_NAME='createPiperPhonemize' -s EXPORTED_FUNCTIONS='[_main]' -s EXPORTED_RUNTIME_METHODS='[callMain, FS]' --preload-file /wasm/espeak-ng/espeak-ng-data@/espeak-ng-data"
emmake cmake --build build --config Release # fails on "Compile intonations / Permission denied", continue with next steps
sed -i 's+\$(MAKE) \$(MAKESILENT) -f CMakeFiles/data.dir/build.make CMakeFiles/data.dir/build+#\0+g' /wasm/piper-phonemize/build/e/src/espeak_ng_external-build/CMakeFiles/Makefile2
sed -i 's/using namespace std/\/\/\0/g' /wasm/piper-phonemize/build/e/src/espeak_ng_external/src/speechPlayer/src/speechWaveGenerator.cpp
emmake cmake --build build --config Release
EOF
cp $TMP/piper-phonemize/build/piper_phonemize.* $BUILD_DIR
rm -rf $TMP
This script will automatically build and copy piper_phonemize.data
piper_phonemize.wasm
piper_phonemize.js
into ./build-piper
folder.
Under the hood this script will:
piper-phonemize
./build-piper
folder and copy wasm artifacts in it. It would be awesome if Piper's awesome TTS could generate the audio locally in the browser e.g: on an old phone, but the dependency on ONNX and the eSpeak variant makes this tricky. Streaming audio to and from a server is often fine but generating the audio locally could avoid needing to setup server infrastructure, and once cached could be faster, more private and work offline, without caring about network dead spots. It could be great for browser extensions too.
There is an eSpeak-ng "espeakng.js" demo here: https://www.readbeyond.it/espeakng/ With source here: https://github.com/espeak-ng/espeak-ng/tree/master/emscripten
Obviously it's not quite as magical as Piper but I think it's exciting. I can happily hack stuff together with Python and Docker, but I'm out of my depth with compiling stuff to different architectures, so after having a look, I'm backing off for now, but I thought I'd share what I learned in case others with relevant skills were also interested:
Both eSpeak-ng and ONNX Runtime Web have different ways of being compiled, but it turns out that they both are run in browsers via Emscripten.
For whatever it's worth, someone else has a another way of building a subset here: https://github.com/ianmarmour/espeak-ng.js/tree/main
There are ONNX web runtimes too.
ONNX Runtime Web, shares it's parent projects, really massive Python build helper script, but there is a quite helpful FAQ, that indicates it has a static builds, demonstrated with build info too: https://onnxruntime.ai/docs/build/web.html https://www.npmjs.com/package/onnxruntime-web
Footnote:
I did have a look at container2wasm for this too, but I couldn't quickly figure out how input and output of files would work. As well as looking at how Copy.sh's browser x86 emulator, v86 can use Arch with a successfully running Docker implementation! With v86 there are examples of doing input and output with files but getting everything working for x86 with 32 bit architecture seemed too complicated to me and might be a bit much, compared to compiling with Emscripten properly, even if it would potentially be usable for much more than cheekily running lots of arbitrary things in the browser.
P.S: awesome work @synesthesiam !
@iSuslov can you provide a simple POC to test your work, i'm near to backend, but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js), i'm little lost wich where to start.
Also @jozefchutka if you can share your source code for your poc will be a good starting point to understand this artifacs.
Best regards!
but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js)
@puppetm4st3r Do you want to try sherpa-onnx? It does exactly what you wish to: HTML + js +wasm. There's no need for any other dependencies.
Doc: https://k2-fsa.github.io/sherpa/onnx/tts/wasm/index.html
huggingface space demo for wasm + tts: https://k2-fsa.github.io/sherpa/onnx/tts/wasm/index.html
(Hint: You can copy the files from the huggingface space directly to your own project.)
thanks! I'm following the doc but when I try to build the assets for an spanish model a got this stack trace.
LLVM ERROR: Broken module found, compilation aborted!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/dario/src/emsdk/upstream/bin/wasm-ld -o ../../bin/sherpa-onnx-wasm-main-tts.wasm CMakeFiles/sherpa-onnx-wasm-main-tts.dir/sherpa-onnx-wasm-main-tts.cc.o -L/home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib ../../lib/libsherpa-onnx-c-api.a ../../lib/libsherpa-onnx-core.a ../../lib/libkaldi-native-fbank-core.a ../../lib/libkaldi-decoder-core.a ../../lib/libsherpa-onnx-kaldifst-core.a ../../_deps/onnxruntime-src/lib/libonnxruntime.a ../../lib/libpiper_phonemize.a ../../lib/libespeak-ng.a /home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib/libonnxruntime.a -L/home/dario/src/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten ../../lib/libucd.a ../../lib/libsherpa-onnx-fstfar.a ../../lib/libsherpa-onnx-fst.a -lGL-getprocaddr -lal -lhtml5 -lstubs -lnoexit -lc -ldlmalloc -lcompiler_rt -lc++-noexcept -lc++abi-noexcept -lsockets -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr /tmp/tmpwhke9p6slibemscripten_js_symbols.so --strip-debug --export=CopyHeap --export=malloc --export=free --export=MyPrint --export=SherpaOnnxCreateOfflineTts --export=SherpaOnnxDestroyOfflineTts --export=SherpaOnnxDestroyOfflineTtsGeneratedAudio --export=SherpaOnnxOfflineTtsGenerate --export=SherpaOnnxOfflineTtsGenerateWithCallback --export=SherpaOnnxOfflineTtsNumSpeakers --export=SherpaOnnxOfflineTtsSampleRate --export=SherpaOnnxWriteWave --export=_emscripten_stack_alloc --export=__get_temp_ret --export=__set_temp_ret --export=__wasm_call_ctors --export=emscripten_stack_get_current --export=_emscripten_stack_restore --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-table -z stack-size=10485760 --max-memory=2147483648 --initial-memory=536870912 --no-entry --table-base=1 --global-base=1024
#0 0x0000564a79ff0228 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf86228)
#1 0x0000564a79fed65e llvm::sys::RunSignalHandlers() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf8365e)
#2 0x0000564a79ff0e7f SignalHandler(int) Signals.cpp:0:0
#3 0x00007722bf8f7520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007722bf94b9fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)
#5 0x00007722bf8f7476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
#6 0x00007722bf8dd7f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
#7 0x0000564a79f5e4c3 llvm::report_fatal_error(llvm::Twine const&, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xef44c3)
#8 0x0000564a79f5e306 (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xef4306)
#9 0x0000564a7ad5186e (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce786e)
#10 0x0000564a7c873a82 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x3809a82)
#11 0x0000564a7ad4aae1 llvm::lto::opt(llvm::lto::Config const&, llvm::TargetMachine*, unsigned int, llvm::Module&, bool, llvm::ModuleSummaryIndex*, llvm::ModuleSummaryIndex const*, std::__2::vector<unsigned char, std::__2::allocator<unsigned char>> const&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce0ae1)
#12 0x0000564a7ad4ca42 llvm::lto::backend(llvm::lto::Config const&, std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1ce2a42)
#13 0x0000564a7ad3c2aa llvm::lto::LTO::runRegularLTO(std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1cd22aa)
#14 0x0000564a7ad3b5c9 llvm::lto::LTO::run(std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, std::__2::function<llvm::Expected<std::__2::function<llvm::Expected<std::__2::unique_ptr<llvm::CachedFileStream, std::__2::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>> (unsigned int, llvm::StringRef, llvm::Twine const&)>) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1cd15c9)
#15 0x0000564a7a3eca96 lld::wasm::BitcodeCompiler::compile() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1382a96)
#16 0x0000564a7a3eea74 lld::wasm::SymbolTable::compileBitcodeFiles() (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1384a74)
#17 0x0000564a7a3d6135 lld::wasm::(anonymous namespace)::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) Driver.cpp:0:0
#18 0x0000564a7a3d1035 lld::wasm::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0x1367035)
#19 0x0000564a79ff325e lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xf8925e)
#20 0x0000564a79f37481 lld_main(int, char**, llvm::ToolContext const&) (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xecd481)
#21 0x0000564a79f37e64 main (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xecde64)
#22 0x00007722bf8ded90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#23 0x00007722bf8dee40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#24 0x0000564a79eace2a _start (/home/dario/src/emsdk/upstream/bin/wasm-ld+0xe42e2a)
em++: error: '/home/dario/src/emsdk/upstream/bin/wasm-ld -o ../../bin/sherpa-onnx-wasm-main-tts.wasm CMakeFiles/sherpa-onnx-wasm-main-tts.dir/sherpa-onnx-wasm-main-tts.cc.o -L/home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib ../../lib/libsherpa-onnx-c-api.a ../../lib/libsherpa-onnx-core.a ../../lib/libkaldi-native-fbank-core.a ../../lib/libkaldi-decoder-core.a ../../lib/libsherpa-onnx-kaldifst-core.a ../../_deps/onnxruntime-src/lib/libonnxruntime.a ../../lib/libpiper_phonemize.a ../../lib/libespeak-ng.a /home/dario/src/tts/sherpa-onnx/build-wasm-simd-tts/_deps/onnxruntime-src/lib/libonnxruntime.a -L/home/dario/src/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten ../../lib/libucd.a ../../lib/libsherpa-onnx-fstfar.a ../../lib/libsherpa-onnx-fst.a -lGL-getprocaddr -lal -lhtml5 -lstubs -lnoexit -lc -ldlmalloc -lcompiler_rt -lc++-noexcept -lc++abi-noexcept -lsockets -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr /tmp/tmpwhke9p6slibemscripten_js_symbols.so --strip-debug --export=CopyHeap --export=malloc --export=free --export=MyPrint --export=SherpaOnnxCreateOfflineTts --export=SherpaOnnxDestroyOfflineTts --export=SherpaOnnxDestroyOfflineTtsGeneratedAudio --export=SherpaOnnxOfflineTtsGenerate --export=SherpaOnnxOfflineTtsGenerateWithCallback --export=SherpaOnnxOfflineTtsNumSpeakers --export=SherpaOnnxOfflineTtsSampleRate --export=SherpaOnnxWriteWave --export=_emscripten_stack_alloc --export=__get_temp_ret --export=__set_temp_ret --export=__wasm_call_ctors --export=emscripten_stack_get_current --export=_emscripten_stack_restore --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-table -z stack-size=10485760 --max-memory=2147483648 --initial-memory=536870912 --no-entry --table-base=1 --global-base=1024' failed (received SIGABRT (-6))
make[2]: *** [wasm/tts/CMakeFiles/sherpa-onnx-wasm-main-tts.dir/build.make:111: bin/sherpa-onnx-wasm-main-tts.js] Error 1
make[1]: *** [CMakeFiles/Makefile2:1281: wasm/tts/CMakeFiles/sherpa-onnx-wasm-main-tts.dir/all] Error 2
make: *** [Makefile:156: all] Error 2
Installed cmake with apt-get, and then with pip. both got me that stack trace... Do you know what it could be happened?
the model selected was:
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-es_MX-claude-high.tar.bz2
Best regards!
How much RAM does your computer have?
Could you try make -j1
?
@puppetm4st3r
64gb, free at least 90%, will try!
64gb, free at least 90%, will try!
Does it work now?
yes! thanks!
@iSuslov can you provide a simple POC to test your work, i'm near to backend, but I need to implement this on a web with the less dependencies (html+js+wams if possible with no additional frameworks like node.js), i'm little lost wich where to start.
Hey @puppetm4st3r, I see your issue is resolved, but in case my script seems to be confusing I would like to clarify:
Script will download and compile everything it needs producing wasm build in the same folder. Docker should be preinstalled.
thanks, now i got another issue, when I compile with your script @iSuslov it works like a charm on desktop webbrowser, but did not work on iOs with OOM error. but when tryed the other solution from @csukuangfj it work on iphone, but cant get running for spanish models wih @csukuangfj method. I'm stuck :(
but cant get running for spanish models wih @csukuangfj method.
Could you describe in detail about why you cannot run it?
when I tried your advice it finally didn't work, it was a false positive my mistake, cache wont refresh and I was testing with the solution from @iSuslov, it still send me the stack trace that I attached here. But if I clone your sample code with the model in English it works (with no building process, just the sample code with wasm binaries). Tryed to compile inside a clean docker and outside the docker on my machine, both didnt work.
Script from @iSuslov works, but when I tryid on iOS crash with OOM, your sample from HF space works on iOS without problem.
Tryed to compile inside a clean docker and outside the docker on my machine, both didnt work
Would be great if you can post error logs. Otherwise, we don't know what you mean when you say didn't work
.
@puppetm4st3r
@puppetm4st3r just out of curiosity, when you say you testing it in iOS do you mean you test it in Safari on iPhone? I've never faced any OOM issues with wasm. Maybe there is an issue in how this script is loaded.
I tried on iOS iphone safari/chrome but I realized that it is not the wasm for some very strange reason if I test my device using my private network address 192.168.x. and everything works fine I just discovered it, however when accessing the same device via router by public IP fails with an OOM error which makes no sense, I will remotely debug the iPhone and bring you the logs and evidence to leave the case documented in case it is of use to someone. I hope I can solve it now that I know that apparently it is an infrastructure problem...
@csukuangfj will post later the logs (are very long) maybe I will upload to drive or something...
@iSuslov additionally I have tested on Android and works fine, the problem is with iOS when exposing the service thru the cloud, so I think is a problem with the infra. But still cant build with the guide from @csukuangfj (have pending to attach the logs of the build process)
For those who are looking for a ready-to-use solution, I have compiled all the knowledge shared in this thread into this library: https://github.com/diffusion-studio/vits-web .
Thanks to everyone here for the awesome solutions and code snippets!
Re "in the web browser" is tricky because we have to find someway to load these voice files for each time the voice is used, on each origin the voice is used.
There is Native Messaging where we can run/control/communicate to and from native applications from the browser.
This native-messaging-espeak-ng is one variation of what I've been doing with eSpeak-NG for years now, mainly because I wanted to support SSML input (see SSMLParser), which I don't see mentioned here at all.
What this (using Native Messaging) means is that we don't have to compile anything to WASM. We can use piper as-is, send inoput to piper and send the output to the browser.
An option for using piper
and onnx
voices in the browser is through Speech Dispatcher, which Chromium-based browsers (Chrome, Brave, Opera, Edge) and Firefox use for Web Speech API.
I have added the piper
module to Speech Dispatcher following the instructions here module request: piper #866.
Tested on Chromium Version 128.0.6586.0 (Developer Build) (64-bit) and Firefox Nightly 130.0a1. Chromium works. Firefox does not load the piper
voices.
In pertinent part.
Download the piper
executable from releases, extract the contents and save to ~/.local/opt/piper
.
Download a couple .onnx
files and save to ~/.local/share/piper/voices
.
Create a symbolic link to piper
executable in ~/.local/bin
, ln -s ~/.local/opt/piper/piper piper
.
Install python3-speechd
sudo apt install python3-speechd
spd-conf -u
Modify ~/.config/speech-dispatcher/speechd.conf
to add the piper
module
AddModule "piper" "sd_generic" "piper.conf"
or set piper
as the default module
DefaultModule espeak-ng
# piper
Create ~/.config/speech-dispatcher/modules/piper.conf
Debug 0
GenericExecuteSynth "printf %s \'$DATA\' | /home/xubuntu/.local/bin/piper --length_scale 1 --sentence_silence 0 --model ~/.local/share/piper/voices/$VOICE --output-raw | aplay -r 22050 -f S16_LE -t raw -"
# only use medium quality voices to respect the 22050 rate for aplay in the command above.
GenericCmdDependency "piper"
GenericCmdDependency "aplay"
GenericCmdDependency "printf"
GenericSoundIconFolder "/usr/share/sounds/sound-icons/"
GenericPunctNone ""
GenericPunctSome "--punct=\"()<>[]{}\""
GenericPunctMost "--punct=\"()[]{};:\""
GenericPunctAll "--punct"
#GenericStripPunctChars ""
GenericLanguage "en" "en_US" "utf-8"
AddVoice "en" "MALE1" "en_US-hfc_male-medium.onnx"
AddVoice "en" "FEMALE1" "en_US-hfc_female-medium.onnx"
DefaultVoice "en_US-hfc_male-medium.onnx"
#GenericRateForceInteger 1
#GenericRateAdd 1
#GenericRateMultiply 100
Restart speech-dispatcher
with speech-dispatcher restart
.
Terminate and restart chrome
, killall -9 chrome
.
Open DevTools, test in console
var voices = speechSynthesis.getVoices().filter(({name}) => name.includes("piper"));
var u = new SpeechSynthesisUtterance();
u.voice = voices[0];
u.text = "Test, test, test. Test to the point it breaks.";
speechSynthesis.speak(u);
console.log(JSON.stringify(voices.map(({default:_default, lang, localService, name, voiceURI}) => ({_default, lang, localService, name, voiceURI})), null, 2));
VM924:6 [
{
"_default": false,
"lang": "en",
"localService": true,
"name": "en_US-hfc_female-medium.onnx piper",
"voiceURI": "en_US-hfc_female-medium.onnx piper"
},
{
"_default": false,
"lang": "en",
"localService": true,
"name": "en_US-hfc_male-medium.onnx piper",
"voiceURI": "en_US-hfc_male-medium.onnx piper"
}
]
Has anyone here managed to get GPU inference working in the browser? Seems like this could provide massive speedups and would be especially useful for long form content like video narrations or audiobook generation.
From what I can see, the current packages for piper in the browser are as follows, but neither support GPU inference.
Wanted to raise this here as a central spot so work is not duplicated. Would transformers.js be usable since piper is an onnx model? Or do we need something else?
I am looking to create a web-based audiobook generation program similar to my CLI project QuickPiperAudiobook. Feel free to reach out if anyone is working on similar things / wants to hack on things together.
The vits-web version is slow. You have to load the WebAssembly module, and voices. Some voices are 60MB. Here's a fork of vits-web that you can test online for yourself https://guest271314.github.io/vits-web/.
I created a Native Messaging host to control the execution of the piper
from the browser, with the output_raw
PCM stream sent to the arbitrary Web page, then written to a MediaStreamTrack
so the TTS output can be shared with any peer in the world that has WebRTC implemented. I also wrote a Web Audio API version that uses AudioWorklet
for real-time playback of the raw PCM stream from piper
, see https://github.com/guest271314/native-messaging-piper; background.js
for MediaStreamTrackGenerator
version, and background-aw.js
AudioWorklet
version.
@C-Loftus Ideally we compile piper
to a WASM file, including the option to pass output_raw
, so we can actually stream from WebAssembly/WASI. Without using Emscripten, so we don't have to deal with loading Worker
s; so the same code can be run in the browser and using wasmtime
.
Thanks for your work and context on that @guest271314 ! The native messaging work is very cool. I think I was hoping to have it run entirely in the browser with the GPU and no need to install on the host.
At least for my use case, I am fine loading the voice ever time it is used (I don't need real-time speed).
Ideally we compile piper to a WASM file, including the option to pass output_raw
Isn't this WASM compilation already done at https://github.com/diffusionstudio/piper-wasm ? Don't we just need an integration from transformers.js or wonnx?
I am not as familiar with some of the lower level browser APIs so sorry if I am missing a connection between them and WebGPU you are trying to point out.
At least for my use case, I am fine loading the voice ever time it is used (I don't need real-time speed).
Then you should be able to use the fork and/or the main vits-web code.
I think I was hoping to have it run entirely in the browser with the GPU and no need to install on the host.
The example runs in the browser.
Isn't this WASM compilation already done at https://github.com/diffusionstudio/piper-wasm ? Don't we just need an integration from transformers.js or wonnx?
If you look at the source code of the GitHub Pages example, the Emscripten generated code is JavaScript, not .wasm
. onnx-runtime-web
is used.
https://github.com/guest271314/vits-web/blob/patch-1/docs/index.js#L1-L2
import { createPiperPhonemize } from "./piper.js";
import * as ort from "./onyx-runtimeweb.js";
Ideally we just use the gloabal WebAssembly
itself to define piper
in its entirety in the single file .wasm
file. At least that's how I see it.
Something like this
// https://www.webassemblyman.com/webassembly_wat_hello_world.html
// https://gist.github.com/cure53/f4581cee76d2445d8bd91f03d4fa7d3b
const wasm = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 8, 2, 96, 1, 127, 0, 96, 0, 0, 2, 15, 1, 3, 101, 110, 118, 7, 106, 115, 112, 114, 105, 110, 116, 0, 0, 3, 2, 1, 1, 5, 3, 1, 0, 1, 7, 27, 2, 10, 112, 97, 103, 101, 109, 101, 109, 111, 114, 121, 2, 0, 10, 104, 101, 108, 108, 111, 119, 111, 114, 108, 100, 0, 1, 10, 8, 1, 6, 0, 65, 0, 16, 0, 11, 11, 19, 1, 0, 65, 0, 11, 13, 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 0]);
class Go {
constructor() {
this.importObject = {
env: {
jsprint: function jsprint(byteOffset) {
console.log(new TextDecoder().decode(new Uint8Array(memory.buffer).filter(Boolean)));
},
},
};
}
run(_instance) {
globalThis.memory = _instance.exports.pagememory;
globalThis.helloworld = _instance.exports.helloworld;
}
}
const go = new Go();
const {instance} = await WebAssembly.instantiateStreaming(fetch(URL.createObjectURL(new Blob([wasm],{
type: 'application/wasm',
}))), go.importObject);
go.run(instance);
helloworld();
where instead of helloworld()
we call piper()
. No extra runtime stuff. Just the universal WebAssembly executable. That's one of the ideas that lead to WebAssembly, from my understanding.
Native Messaging works for me. I don't have an issue executing code on my own machine from the browser.
It would be awesome if Piper's awesome TTS could generate the audio locally in the browser e.g: on an old phone, but the dependency on ONNX and the eSpeak variant makes this tricky. Streaming audio to and from a server is often fine but generating the audio locally could avoid needing to setup server infrastructure, and once cached could be faster, more private and work offline, without caring about network dead spots. It could be great for browser extensions too.
There is an eSpeak-ng "espeakng.js" demo here: https://www.readbeyond.it/espeakng/ With source here: https://github.com/espeak-ng/espeak-ng/tree/master/emscripten
Obviously it's not quite as magical as Piper but I think it's exciting. I can happily hack stuff together with Python and Docker, but I'm out of my depth with compiling stuff to different architectures, so after having a look, I'm backing off for now, but I thought I'd share what I learned in case others with relevant skills were also interested:
Both eSpeak-ng and ONNX Runtime Web have different ways of being compiled, but it turns out that they both are run in browsers via Emscripten.
For whatever it's worth, someone else has a another way of building a subset here: https://github.com/ianmarmour/espeak-ng.js/tree/main
There are ONNX web runtimes too.
ONNX Runtime Web, shares it's parent projects, really massive Python build helper script, but there is a quite helpful FAQ, that indicates it has a static builds, demonstrated with build info too: https://onnxruntime.ai/docs/build/web.html https://www.npmjs.com/package/onnxruntime-web
Footnote:
I did have a look at container2wasm for this too, but I couldn't quickly figure out how input and output of files would work. As well as looking at how Copy.sh's browser x86 emulator, v86 can use Arch with a successfully running Docker implementation! With v86 there are examples of doing input and output with files but getting everything working for x86 with 32 bit architecture seemed too complicated to me and might be a bit much, compared to compiling with Emscripten properly, even if it would potentially be usable for much more than cheekily running lots of arbitrary things in the browser.
P.S: awesome work @synesthesiam !