Open ye110wd opened 1 year ago
I found how to pipe for streaming.
| ffplay -i pipe: -f s16le -ar 22050
so with my previous unsatisfactory filters
echo "So now that I knew that the loud sections were a little better" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"
A rough attempt at adding DRC without understanding. I need help. It works ok but not great.
compand=attacks=0.2:points=-12.4/-18.4|-6/-12|0/-6.8:6
so now
echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "compand=attacks=0.2:points=-12.4/-18.4|-6/-12|0/-6.8:6, aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"
acompressor works even better than compand
acompressor=threshold=0.089:ratio=5:attack=20:release=100
echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "acompressor=threshold=0.089:ratio=5:attack=20:release=100, aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"
superequalizer sounds better than acrossover but it may have errors at end of output. (10^(−35db/20)=0.01778...)
superequalizer=12b=0.5:13b=0.3:14b=0.2:15b=0.1:16b=0.06:17b=0.03:18b=0.02
echo "So now that I knew that the loud sections were a little softer" | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.3 --noise_w 0.333 --output-raw | ffplay -i pipe: -f s16le -ar 22050 -af "superequalizer=12b=0.5:13b=0.2:14b=0.02, acompressor=threshold=0.03:ratio=5:attack=20:release=100, acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull, aformat=channel_layouts=mono,asetrate=22050*0.9"
I've found how to stream. Now I'm looking for encoding wav directly to mp3, aac, flac, etc. To be clearer, pipe wav into ffmpeg.
I looked for something as a reference and found https://superuser.com/questions/322216/how-can-i-pipe-output-of-ffmpeg-to-ffplay I'm trying to get it to work. No success yet. If anyone understands, the help would be appreciated. I'll keep trying.
I'm giving up for the moment. I'll just save to wav and convert to mp3 with another script. Below works with output-raw
so ffmpeg may have a problem.
This works for streaming
| ffplay -i pipe: -f s16le -ar 22050
This was tested with a single sentence using output-raw and output_file. Both fail. | ffmpeg -i pipe: -f s16le -ar 22050 out.wav
error from ffmpeg report
[NULL @ 0x55f715c9f740] Opening 'pipe:' for reading [pipe @ 0x55f715ca03c0] Setting default whitelist 'crypto,data' [AVIOContext @ 0x55f715cb0740] Statistics: 122484 bytes read, 0 seeks pipe:: Invalid data found when processing input
I think an output_file 'pipe' option would be nice so it can be fed to ffmpeg.
I'm trying to figure out how to pipe to ffmpeg and output as 17 kb/s .opus. My call is:
$ cat Empire-cut.txt | piper --model /lib/piper-voices/en/en_US/ryan/high/en_US-ryan-high.onnx --output-raw | ffmpeg -f s16le -ar 22050 -i pipe: -y -c:a libopus -b:a 17k empire.opus
That -ar 22050 is necessary (and where it is) so that ffmpeg will read/write the speed correctly. I did this and it is coming out with an intelligible .opus file at ca. 17 kb/s with the corresponding duration-calculated file size.
It would be really helpful for—I think—a lot of people just starting out with this if the | ffmpeg info could be included in the general documentation. I'm sure that it would have saved a number of people looking at this message a decent amount of time.
Thanks iconoclasthero
It would be really helpful for—I think—a lot of people just starting out with this if the | ffmpeg info could be included in the general documentation. I'm sure that it would have saved a number of people looking at this message a decent amount of time.
Yes, agree.
Also mentioning that --output-raw will not use sentence wise decoding with ffmpeg.
echo $(xclip -o -sel clip) | ./piper --model en_US-libritts_r-medium.onnx --speaker 0 --noise_scale 0.333 --length_scale 1.35 --noise_w 0.333 --output-raw | ffmpeg -f s16le -ar 22050 -i pipe: -y -c:a libopus -b:a 17k out.opus
works
Also mentioning that --output-raw will not use sentence wise decoding with ffmpeg.
What's this now? I ran a book through it today, but it isn't something I'm super interested in listening to. I did just get something else that I do want to listen to so I'm all ears as to what this means. It did seem that the output I had was somewhat choppy going from sentence to sentence.
Ok, so why do I want to echo something from the clipboard vs. cat file.txt? Also, what's the deal with all the other stuff you have going on in there like --speaker, --noise_scale, etc.? How do you go about selecting/tuning those?
Also, when I convert files to opus, I use this filter. Someone on #ffmpeg gave it to me, it's what he uses. It strips off a little more data than without it and since I'm sacrificing quality for size, it's acceptable.
All below sort of proves that a wiki is needed. Piper is the first good tts for linux. I really hope it grows.
It did seem that the output I had was somewhat choppy going from sentence to sentence.
Maybe that is what it does. I'm using a CPU from 2010 so maybe it's slow enough that transcode time is slightly longer than piper generation time? in that case we really need a dedicated output_file pipe setting. https://github.com/rhasspy/piper/pull/255
Ok, so why do I want to echo something from the clipboard vs. cat file.txt?
cat, echo are just different was to pipe commands. All have the same result
I've been using it to read long pages I copied to clipboard from browser mostly. I wrote something messy with zenity (I have no coding skills besides copy/paste of other peoples work) https://github.com/rhasspy/piper/discussions/248
what's the deal with all the other stuff you have going on in there like --speaker, --noise_scale, etc
You can find them in the json file accompanying the voice file. also ./piper --help
I use those separate settings instead --config en_US-libritts_r-medium.onnx.json because I wanted to adjust the settings. i.e. Increasing --length_scale makes speech slower. I use --length_scale 1.35
The voice setting is in the json too. It's different speaking styles and accents. Speaker 0 is default.
Just to point out to others the order of ffmpeg needs to have format and rate BEFORE the pipe:
| ffmpeg -f s16le -ar 22050 -i pipe:
Thanks again iconoclasthero
I use this filter
It's ok but for about ~20% more data (YMMV) this may sound better for voice | ffmpeg -f s16le -ar 22050 -i pipe: -y -af "acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull" -q:a 8 out.mp3"
opus is better below 64k but can have audio noise at the high end. A higher variable bitrate mp3 has less of it. ffmpeg audio filters shouldn't be discussed in depth here but the do help with file size and sound. I use good headphones to listen.
All below sort of proves that a wiki is needed. Piper is the first good tts for linux. I really hope it grows.
Well, I've listened to a couple books that were produced by something called SkyNet and it was the first time that TTS was worth listening to so I decided to see what was available for me to convert eBooks to audiobooks on my machine. I don't know if it's the best available for linux, but it satisfices my needs. WRT to a wiki, the lord god helps those who help themselves. I haven't seen any response from the project maintainer on this thread, but if you want a wiki, I'm sure that a good first step would be organizing the information you have and want into a framework that the maintainer & other userser can flesh out.
Maybe that is what it does. I'm using a CPU from 2010 so maybe it's slow enough that transcode time is slightly longer than piper generation time? in that case we really need a dedicated output_file pipe setting. #255
Now that I did a TTS straight to wav and converted it, I'm not sure that there's a difference. It's tough to really tell. The way I understand the pipe works is that it is done blockwise so piper converts whatever it it does (it seems like it goes by each sentence) and then that gets piped out to ffmpeg which converts what it can until it fills up the block* and saves it to disk (or ramdrive in my case).
I've been using it to read long pages I copied to clipboard from browser mostly. I wrote something messy with zenity (I have no coding skills besides copy/paste of other peoples work) #248
Ahh, the venerable cargo coding method; one of my favorites!
I use those separate settings instead --config en_US-libritts_r-medium.onnx.json because I wanted to adjust the settings. i.e. Increasing --length_scale makes speech slower. I use --length_scale 1.35
(I think you mean e.g., which means "for example" as it comes from exempli gratia; i.e. means 'that is' from id est.)
I generally listen to books at ca. 2x so that's not much use for me.
Just to point out to others the order of ffmpeg needs to have format and rate BEFORE the pipe:
| ffmpeg -f s16le -ar 22050 -i pipe:
Thanks again iconoclastheroI use this filter
It's ok but for about ~20% more data (YMMV) this may sound better for voice | ffmpeg -f s16le -ar 22050 -i pipe: -y -af "acrossover=5000:order=20th[k][r];[r]anullsink;[k]anull" -q:a 8 out.mp3"
opus is better below 64k but can have audio noise at the high end. A higher variable bitrate mp3 has less of it. ffmpeg audio filters shouldn't be discussed in depth here but the do help with file size and sound. I use good headphones to listen.
Yeah, ffmpeg can be picky about where stuff goes and I don't claim to understand all of it. If I have issues with ffmpeg, #ffmpeg on librechat irc is my go to. I refuse to use a codec that's 35 years old as if nothing in the digital world has improved any in three and a half decades. I use opus to stream mpd over icecast and don't have any issues from 32-96 kb/s (depending on what my data source is). Given that I'm usually listening outside on a bluetooth speaker or in an open Jeep, I doubt I'd notice anyway. Somehow I get a bit of a bonus on file size if I set the output encoding frequency at 24000 even though opus internally uses 48000. It saves about 5% somehow. Across a library of thousands of titles, that adds up. I'll eventually try it with piper, just haven't gotten there yet.
As far as processing: when I convert audiobooks to opus, one ffmpeg/core is optimal so I can run 4 parallel ffmpegs for highest throughput. How does this relate to piper | ffmpeg? I was looking at htop and it seems like piper is using the bulk of the computing time and the blocks sent to ffmpeg don't really use a whole lot of it. As such, it may be less efficient to try to do this in parallel rather than sequentially.
*https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0/
It seems to me that running these in parallel is of little use if not outright counterproductive. I have one call for piper and it's monopolizing all my cores:
so I tested
echo -e "1\n\n2\n\n3\n\n" | ./piper --model en_US-libritts_r-medium.onnx --output_file - | ffplay -f s16le -ar 22050 -ac 1 -i pipe:
It has extreme pops/click with "output_file" and quieter but still noticeable pops/clicks with "output_raw"
It has been there at least since v1.2. I'm using that latest amd64 Linux binary.
It clicks on every new line "0A"
With my audio filters it also clears up enough audio to hear a whistle/chirp sometimes mid sentence. I have to test further.
edit> This works without pop but only decodes up to new line.
echo -e "1\n\n2\n\n3\n\n" | ./piper --model en_US-libritts_r-medium.onnx --output_file - | ffplay -f wav -ar 22050 -ac 1 -i pipe:
I used the following command to enable ffplay to play the audio directly and exit the audio window automatically when the playback was complete.
echo "This is the test audio." | ./piper --model en_US-libritts_r-medium.onnx --output-raw | ffplay - -f s16le -ar 22050 -autoexit
My problem is that piper's monophonic output only gets cast to stereo if it's the sole input stream to alsa. If there are stereo sources playing at the same time, the mono stream only goes to left speaker.
echo "$my_paragraph" | piper --model /pr/TextToSpeech/Piper/Voices/en_US-ryan-high.onnx ---noise_scale 0.433 --length_scale 1.0 --noise_w 0.436 --output-raw 2>/dev/null | ffmpeg -v error -f s16le -ar 22.05k -ac 1 -i - -af "pan=stereo|c0=c0|c1=c0,aresample=resampler=soxr:osr=22050,superequalizer=3b=0.6:4b=0.6:5b=0.7:6b=0.8:11b=1.1:12b=1.2:13b=1.3:14b=1.2:15b=1.1:16b=1.16:17b=1.03:18b=1.00" -f s16le - 2>/dev/null | buffer -m 100000 -u 100 -p 50% | ffplay -nodisp -autoexit -v error -f s16le -af "channelsplit,amerge" - 2>/dev/null
The eq is there to reduce baritone range and boost treble speech for better intelligbility. The osr needs to be adapted to the model frequency. The -ar needs to be adapted to the model frequency (11, 16, 22.05) The unbuffer there was an attempt to get rid of buffer underflows.
Occasionally piper goes into 'fastread' mode where the voice is double-speed and unintelligble. Happens most often with Thorsten for me.
Good enough to read books. Thank the authors for this project.
I'm not even sure I should be posting this here.
Is it possible to pipe output to ffplay when streaming and ffmpeg when saving?
I ask because the output of piper could do with a softer/warmer sound for use with good headphones. I also want to directly (pipe) apply effects and save to aac/mp3
What I'm after is recorded from vinyl to tape effect or tube effect. Something you can fall asleep listening to. At the moment "S" are scratchy and syllables have a very high dynamic ranged giving a 'pumped' feeling. So high frequency reduction and a dynamic range compressor greatly increases comfort for extended use.
ffmpeg DRC is impossible for me to understand so it's not the best but so far I've got.
ffplay test.wav -af "aformat=channel_layouts=mono,asetrate=22050*0.9,acrossover=3500:order=20th[k][r];[r]anullsink;[k]anull"
asetrate=22050*0.9
which slows/lowers pitch (and speed) by 10% which sounds good to me with en_US-libritts_r-medium.onnxacrossover=3500:order=20th[k][r];[r]anullsink;[k]anull
this cuts all frequencies above 3.5kHz.Maybe this is more a community question?