FR: per voice override for regex preprocessing

thiswillbeyourgithub commented 3 weeks ago

Hi,

I noticed that the current way the pre-processing regex is implemented makes it so that for example "1-5" is read as "one to five" but obviously it does not work in all languages: french would be "un a cinq", not " un to cinq".

Another example is "ex." being read as "for example" even though french people use "ex." to but read it as "par example".

It would be great if we could specify an override file for a given voice. Meaning tbe voice yaml file accept a new key for the voice like "override_preproceds_file" that would be a string path.

Also a question: can the regex file be modified at runtime or is it only read at startup? If it can be modified at runtime (which helps for tweaking) then I think it deserves a small sentence in the readme.md :)

Have a nice day!

matatonic commented 3 weeks ago

The regex can be modified at run time, no reload required it's loaded on demand for each API call.

a regex per language makes a lot of sense, yes.

Per voice? Is that what you really mean, if so can you explain why?

thiswillbeyourgithub commented 3 weeks ago

Oh right I thought "per voice of piper", so meaning a language I guess. Well the line is blurry now because of multi lingual piper feature.

matatonic commented 3 weeks ago

Honestly, I kind of hate how much of a hack the regex file is. It doesn't work well for piper AND xtts at the same time either, they each have unique and different flaws. I am very open to suggestions for how to fix it, or a better system. For now, I am considering simply adding a search for a language specific regex file, ex. pre_process_map.fr.yaml

thiswillbeyourgithub commented 3 weeks ago

Well. I don't know the flaws but I'm thinking a maximally customizable implementation has to be the way to go. You are bound to have to switch to new better tts models in the future.

So I'm thinking:

the preprocess yaml should have 2 keys, "before splitting" and "after splitting"
each of those should contain as value a list of 2 tuple like currently. When loading they will be compiled and stored i as kv in a dict (as python dicts are ordered)
the regex are applied in order of appearance in the list and recursuively (meaning they all try to apply)
each model, voice and language, should accept an override file that would create a new dict like so leveldict = upperleveldict.copy() ; leveldict.update(newvaluesdict)
at query time, the lowest level dict would ve used. Applying in order each regex of the before split key. Then split the whole text. Then apply in order the after split.

Oh wait i'm guessing the split happens before it's sent to the voice right? If that's the case then a unique preprocessing yaml containing a list should apply before the split. And then what I said above but without the before split list :)

Ps: i don't know how much time it takes to load and apply the regex but as you said it's at runtime, I do think you could gain some ms by loading them at startup, running re.compile on each. Then at query time reload the dict only if the yaml mod time was changed. It might gain actual time for things like a raspberry pi especially if using long regex in "before split". Iirc regex uncompiled can be 10 to 100 times slower than applying str.replace. I do think that each ms saved has value for tts :). Additionally it would allow checking the validity of the yaml on launch instead of waiting for the first query. Should I create an issue to track this?

Ps2: there's an overhead when creating a subprocess to call piper but there's a way to call it directly as a lib. It was very hard to figure it out at the time from the repo but I'm using this in a small script I made a while ago i'm thinking you could create the subprocess in advance once and for all and use that inside. Lots more ms to gain there possibly?

matatonic commented 3 weeks ago

So, my take on your comments is that if I added a new configuration key, which is available at any level of the voice_to_speaker.yaml, and is merged with any higher level file, then it should work as you describe.

# the default top level will be pre_process_map.yaml, but include it here for example sake:
pre_process_map: pre_process_map.yaml # this one is common to piper and xtts, and represents the current default configuration
tts-1:
  pre_process_map: pre_process_map.tts-1.yaml # this one is for generic 'piper'isms
  alloy:
    pre_process_map: pre_process_map.alloy.yaml # This doesn't make a lot of sense to me, but it could work anyways
    pre_process_map: pre_process_map.tts-1.alloy.yaml # OR: same here, not much purpose here
    en:
      pre_process_map: pre_process_map.tts-1.alloy.en.yaml # specific alloy en piper isms, doesn't make much sense  
      pre_process_map: pre_process_map.tts-1.en.yaml # OR: general piper/en specifics 
      pre_process_map: pre_process_map.en.yaml # OR: general language specific

Even if they don't make much sense, that's how it could work, how people use it is really up to them.

Some limitations to consider, the input to the API is: 'voice', 'model' - that's all, language is an auto-detected feature which happens after the model has been determined (because the set of languages possible for playback is limited by the model). Because the file is a model.voice config, there isn't a great way to organize it around languages instead. Also, I'm still not very happy with language detection, 99% of users should really disable it and I may require it to be enabled before use, the current dev branch isn't merged or released yet. Maybe like --detect-languages en,es with the default being none or 'en' but allowing 'any'.

I've often considered switching piper to the python implementation, but the piped process is so fast and streams from a real other process which avoids all the python "multi-thread" problems, I don't think I will bother. It's actually a very efficient simple pipeline and the onnx models must be mmap'd into place, so it's essentially instant on most linux systems.

In the same vein, the regex/yaml processing is so far unnoticeable, so I probably wont go out of my way to make it more complex than needed.

If anyone is interested in an efficient, high performance, GPU accelerated, high concurrency implementation of text to speech for large scale deployment (1000's of users) - I'm happily available for paid consultation and would enjoy the challenge :-)

thiswillbeyourgithub commented 3 weeks ago

Regarding lang detection, you're the boss. Imho what matters is that people who run instances can configure it, the default are not critical imo for docker containers because it's only adressed at pretty savvy people to start with. So if I have to enable it it's fine for me. Thanks again for doing it in the first place :)

And about speed, yeah okay. Piper advertises as being able to run on raspberry pis so I'm a little suspicious that what you're saying is still actually true for low powered systems if you're just infering from your experience on desktop computers but you're probably right. In any case not having time to spend for those edge cases is understandable of course. And your code is clean enough to make it easy to PR if I happen to need it.

matatonic / openedai-speech

FR: per voice override for regex preprocessing #54