rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.44k stars 473 forks source link

Use ffplay ffmpeg instead of aplay? #258

Open ye110wd opened 1 year ago

ye110wd commented 1 year ago

I'm not even sure I should be posting this here.

Is it possible to pipe output to ffplay when streaming and ffmpeg when saving?

I ask because the output of piper could do with a softer/warmer sound for use with good headphones. I also want to directly (pipe) apply effects and save to aac/mp3

What I'm after is recorded from vinyl to tape effect or tube effect. Something you can fall asleep listening to. At the moment "S" are scratchy and syllables have a very high dynamic ranged giving a 'pumped' feeling. So high frequency reduction and a dynamic range compressor greatly increases comfort for extended use.

ffmpeg DRC is impossible for me to understand so it's not the best but so far I've got.

ffplay test.wav -af "aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"

asetrate=22050*0.9 which slows/lowers pitch (and speed) by 10% which sounds good to me with en_US-libritts_r-medium.onnx

acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull this cuts all frequencies above 3.5kHz.

Maybe this is more a community question?

ye110wd commented 1 year ago

I found how to pipe for streaming. | ffplay -i pipe: -f s16le -ar 22050

so with my previous unsatisfactory filters echo "So now that I knew that the loud sections were a little better" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"

A rough attempt at adding DRC without understanding. I need help. It works ok but not great. compand=attacks=0.2:points=-12.4/-18.4|-6/-12|0/-6.8:6

so now echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "compand=attacks=0.2:points=-12.4/-18.4|-6/-12|0/-6.8:6, aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"

acompressor works even better than compand acompressor=threshold=0.089:ratio=5:attack=20:release=100

echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "acompressor=threshold=0.089:ratio=5:attack=20:release=100, aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"

superequalizer sounds better than acrossover but it may have errors at end of output. (10^(−35db/20)=0.01778...)

superequalizer=12b=0.5:13b=0.3:14b=0.2:15b=0.1:16b=0.06:17b=0.03:18b=0.02

echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "superequalizer=12b=0.5:13b=0.2:14b=0.02, acompressor=threshold=0.03:ratio=5:attack=20:release=100, acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull, aformat=channel_layouts=mono,asetrate=22050*0.9"

ye110wd commented 1 year ago

I've found how to stream. Now I'm looking for encoding wav directly to mp3, aac, flac, etc. To be clearer, pipe wav into ffmpeg.

I looked for something as a reference and found https://superuser.com/questions/322216/how-can-i-pipe-output-of-ffmpeg-to-ffplay I'm trying to get it to work. No success yet. If anyone understands, the help would be appreciated. I'll keep trying.

ye110wd commented 1 year ago

I'm giving up for the moment. I'll just save to wav and convert to mp3 with another script. Below works with output-raw

so ffmpeg may have a problem. This works for streaming | ffplay -i pipe: -f s16le -ar 22050

This was tested with a single sentence using output-raw and output_file. Both fail. | ffmpeg -i pipe: -f s16le -ar 22050 out.wav

error from ffmpeg report

[NULL @ 0x55f715c9f740] Opening 'pipe:' for reading [pipe @ 0x55f715ca03c0] Setting default whitelist 'crypto,data' [AVIOContext @ 0x55f715cb0740] Statistics: 122484 bytes read, 0 seeks pipe:: Invalid data found when processing input

I think an output_file 'pipe' option would be nice so it can be fed to ffmpeg.

iconoclasthero commented 1 year ago

I'm trying to figure out how to pipe to ffmpeg and output as 17 kb/s .opus. My call is: $ cat Empire-cut.txt | piper --model /lib/piper-voices/en/en_US/ryan/high/en_US-ryan-high.onnx --output-raw | ffmpeg -f s16le -ar 22050 -i pipe: -y -c:a libopus -b:a 17k empire.opus That -ar 22050 is necessary (and where it is) so that ffmpeg will read/write the speed correctly. I did this and it is coming out with an intelligible .opus file at ca. 17 kb/s with the corresponding duration-calculated file size.

It would be really helpful for—I think—a lot of people just starting out with this if the | ffmpeg info could be included in the general documentation. I'm sure that it would have saved a number of people looking at this message a decent amount of time.

ye110wd commented 1 year ago

Thanks iconoclasthero

It would be really helpful for—I think—a lot of people just starting out with this if the | ffmpeg info could be included in the general documentation. I'm sure that it would have saved a number of people looking at this message a decent amount of time.

Yes, agree.

Also mentioning that --output-raw will not use sentence wise decoding with ffmpeg.

echo $(xclip -o -sel clip) | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.35 --noise_w 0.333 --output-raw | ffmpeg -f s16le -ar 22050 -i pipe: -y -c:a libopus -b:a 17k out.opus works

iconoclasthero commented 1 year ago

Also mentioning that --output-raw will not use sentence wise decoding with ffmpeg.

What's this now? I ran a book through it today, but it isn't something I'm super interested in listening to. I did just get something else that I do want to listen to so I'm all ears as to what this means. It did seem that the output I had was somewhat choppy going from sentence to sentence.

iconoclasthero commented 1 year ago

Ok, so why do I want to echo something from the clipboard vs. cat file.txt? Also, what's the deal with all the other stuff you have going on in there like --speaker, --noise_scale, etc.? How do you go about selecting/tuning those?

Also, when I convert files to opus, I use this filter. Someone on #ffmpeg gave it to me, it's what he uses. It strips off a little more data than without it and since I'm sacrificing quality for size, it's acceptable.

ye110wd commented 1 year ago

All below sort of proves that a wiki is needed. Piper is the first good tts for linux. I really hope it grows.

It did seem that the output I had was somewhat choppy going from sentence to sentence.

Maybe that is what it does. I'm using a CPU from 2010 so maybe it's slow enough that transcode time is slightly longer than piper generation time? in that case we really need a dedicated output_file pipe setting. https://github.com/rhasspy/piper/pull/255

Ok, so why do I want to echo something from the clipboard vs. cat file.txt?

cat, echo are just different was to pipe commands. All have the same result

I've been using it to read long pages I copied to clipboard from browser mostly. I wrote something messy with zenity (I have no coding skills besides copy/paste of other peoples work) https://github.com/rhasspy/piper/discussions/248

what's the deal with all the other stuff you have going on in there like --speaker, --noise_scale, etc

You can find them in the json file accompanying the voice file. also ./piper --help

I use those separate settings instead --config en_US-libritts_r-medium.onnx.json because I wanted to adjust the settings. i.e. Increasing --length_scale makes speech slower. I use --length_scale 1.35

The voice setting is in the json too. It's different speaking styles and accents. Speaker 0 is default.

Just to point out to others the order of ffmpeg needs to have format and rate BEFORE the pipe:

| ffmpeg -f s16le -ar 22050 -i pipe:

Thanks again iconoclasthero

I use this filter

It's ok but for about ~20% more data (YMMV) this may sound better for voice | ffmpeg -f s16le -ar 22050 -i pipe: -y -af "acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull" -q:a 8 out.mp3"

opus is better below 64k but can have audio noise at the high end. A higher variable bitrate mp3 has less of it. ffmpeg audio filters shouldn't be discussed in depth here but the do help with file size and sound. I use good headphones to listen.

iconoclasthero commented 1 year ago

All below sort of proves that a wiki is needed. Piper is the first good tts for linux. I really hope it grows.

Well, I've listened to a couple books that were produced by something called SkyNet and it was the first time that TTS was worth listening to so I decided to see what was available for me to convert eBooks to audiobooks on my machine. I don't know if it's the best available for linux, but it satisfices my needs. WRT to a wiki, the lord god helps those who help themselves. I haven't seen any response from the project maintainer on this thread, but if you want a wiki, I'm sure that a good first step would be organizing the information you have and want into a framework that the maintainer & other userser can flesh out.

Maybe that is what it does. I'm using a CPU from 2010 so maybe it's slow enough that transcode time is slightly longer than piper generation time? in that case we really need a dedicated output_file pipe setting. #255

Now that I did a TTS straight to wav and converted it, I'm not sure that there's a difference. It's tough to really tell. The way I understand the pipe works is that it is done blockwise so piper converts whatever it it does (it seems like it goes by each sentence) and then that gets piped out to ffmpeg which converts what it can until it fills up the block* and saves it to disk (or ramdrive in my case).

I've been using it to read long pages I copied to clipboard from browser mostly. I wrote something messy with zenity (I have no coding skills besides copy/paste of other peoples work) #248

Ahh, the venerable cargo coding method; one of my favorites!

I use those separate settings instead --config en_US-libritts_r-medium.onnx.json because I wanted to adjust the settings. i.e. Increasing --length_scale makes speech slower. I use --length_scale 1.35

(I think you mean e.g., which means "for example" as it comes from exempli gratia; i.e. means 'that is' from id est.)

I generally listen to books at ca. 2x so that's not much use for me.

Just to point out to others the order of ffmpeg needs to have format and rate BEFORE the pipe:

| ffmpeg -f s16le -ar 22050 -i pipe: Thanks again iconoclasthero

I use this filter

It's ok but for about ~20% more data (YMMV) this may sound better for voice | ffmpeg -f s16le -ar 22050 -i pipe: -y -af "acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull" -q:a 8 out.mp3"

opus is better below 64k but can have audio noise at the high end. A higher variable bitrate mp3 has less of it. ffmpeg audio filters shouldn't be discussed in depth here but the do help with file size and sound. I use good headphones to listen.

Yeah, ffmpeg can be picky about where stuff goes and I don't claim to understand all of it. If I have issues with ffmpeg, #ffmpeg on librechat irc is my go to. I refuse to use a codec that's 35 years old as if nothing in the digital world has improved any in three and a half decades. I use opus to stream mpd over icecast and don't have any issues from 32-96 kb/s (depending on what my data source is). Given that I'm usually listening outside on a bluetooth speaker or in an open Jeep, I doubt I'd notice anyway. Somehow I get a bit of a bonus on file size if I set the output encoding frequency at 24000 even though opus internally uses 48000. It saves about 5% somehow. Across a library of thousands of titles, that adds up. I'll eventually try it with piper, just haven't gotten there yet.

As far as processing: when I convert audiobooks to opus, one ffmpeg/core is optimal so I can run 4 parallel ffmpegs for highest throughput. How does this relate to piper | ffmpeg? I was looking at htop and it seems like piper is using the bulk of the computing time and the blocks sent to ffmpeg don't really use a whole lot of it. As such, it may be less efficient to try to do this in parallel rather than sequentially.

*https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0/

iconoclasthero commented 1 year ago

It seems to me that running these in parallel is of little use if not outright counterproductive. I have one call for piper and it's monopolizing all my cores:

image

ye110wd commented 11 months ago

so I tested echo -e "1\n\n2\n\n3\n\n" | ./piper --model en_US-libritts_r-medium.onnx --output_file - | ffplay -f s16le -ar 22050 -ac 1 -i pipe:

It has extreme pops/click with "output_file" and quieter but still noticeable pops/clicks with "output_raw"

It has been there at least since v1.2. I'm using that latest amd64 Linux binary.

It clicks on every new line "0A"

With my audio filters it also clears up enough audio to hear a whistle/chirp sometimes mid sentence. I have to test further.

edit> This works without pop but only decodes up to new line. echo -e "1\n\n2\n\n3\n\n" | ./piper --model en_US-libritts_r-medium.onnx --output_file - | ffplay -f wav -ar 22050 -ac 1 -i pipe:

904557558 commented 10 months ago

I used the following command to enable ffplay to play the audio directly and exit the audio window automatically when the playback was complete.

echo "This is the test audio." | ./piper --model en_US-libritts_r-medium.onnx --output-raw | ffplay - -f s16le -ar 22050 -autoexit

clort81 commented 6 months ago

My problem is that piper's monophonic output only gets cast to stereo if it's the sole input stream to alsa. If there are stereo sources playing at the same time, the mono stream only goes to left speaker.

echo "$my_paragraph" | piper  --model /pr/TextToSpeech/Piper/Voices/en_US-ryan-high.onnx ---noise_scale 0.433 --length_scale 1.0 --noise_w 0.436 --output-raw 2>/dev/null | ffmpeg -v error  -f s16le -ar 22.05k -ac 1 -i - -af "pan=stereo|c0=c0|c1=c0,aresample=resampler=soxr:osr=22050,superequalizer=3b=0.6:4b=0.6:5b=0.7:6b=0.8:11b=1.1:12b=1.2:13b=1.3:14b=1.2:15b=1.1:16b=1.16:17b=1.03:18b=1.00" -f s16le - 2>/dev/null |  buffer -m 100000 -u 100 -p 50% | ffplay -nodisp -autoexit -v error -f s16le -af "channelsplit,amerge" - 2>/dev/null

The eq is there to reduce baritone range and boost treble speech for better intelligbility. The osr needs to be adapted to the model frequency. The -ar needs to be adapted to the model frequency (11, 16, 22.05) The unbuffer there was an attempt to get rid of buffer underflows.

Occasionally piper goes into 'fastread' mode where the voice is double-speed and unintelligble. Happens most often with Thorsten for me.

Good enough to read books. Thank the authors for this project.