project-spiel / libspiel

Speech synthesis client library
https://project-spiel.org/libspiel
GNU Lesser General Public License v2.1
43 stars 5 forks source link

Considerations about pitch/rate/volume measurement and minimums/maximums. #18

Closed TheMuso closed 10 months ago

TheMuso commented 10 months ago

I have been reading the espeak provider code in spiel-demos, and noticed that the pitch, rate, and volume values are multiplied by various values. Volume makes sense for the most part, and the rate is multiplied by a constant, NORMAL_RATE, which also makes sense, however the value for pitch doesn't make sense to me.

In reading this, and playing with the espeak and spiel-it flatpaks, and reading the interface dbus documentation, I wonder whether there should be a mechanism for a provider to report how it measures pitch, rate, and volume.

One thing I have never liked about Speech Dispatcher is that rate and pitch are set between -100 and 100, and adjusted according to what a synthesizers minimum and maximum values are. As a user, I would prefer to be working with values that are for that synthesizer, i.e espeak's pitch is between 0 and 99.

One solution could be to have some sort of synth query method, which returns volume, pitch, and rate minimums and maximums, and where applicable, possible hints for what those values mean. For example, ESpeak could indicate that the pitch minimum and maximum is 0 and 99 respectively, and the same for rate and volume. The espeak provider could also provide a hint that the rate value is in words per minute.

Thoughts, other suggestions, or rejections welcome.

eeejay commented 10 months ago

Similar to Speech Dispatcher, I think the intention of the API should be abstracting engine specifics. Similar to other platforms and the rate an pitch attribute in the Web Speech API which are multipliers.

A libspiel user should have a good idea of what to expect when passing rate and pitch and they should get a similar result regardless of engine. Since most engines won't give a you a words per minute definition of their rate, I think we are stuck with a multiplier. A typical English speaker will talk at 110-150 words per minute so you can roughly figure out what a typical voice's WPM output would be to and how to adjust.

I wrote a lot about this, and tweaked Firefox to normalize rate across different platforms. http://blog.monotonous.org/2016/03/13/normalizing-speech-rate/ http://blog.monotonous.org/2016/03/17/benchmarking-speech-rate/

eeejay commented 10 months ago

Just to add more, the current providers are prototypes. I expect a well implemented provider to normalize pitch and rate correctly.

TheMuso commented 10 months ago

Fair enough, you have had to deal with this more than I have. :)

Ok, fair enough re providers in spiel-demos.