Faster audio output/processing

Funktionar commented 2 years ago

Possible to use this in real-time communications? Compared with just azure it's slower and I have the deepl API to talk with foreigners. I'd like to get the audio within 200 ms and output it to a sound device, if it's feasible.

kxxt commented 2 years ago

Outputting to default speaker should be as fast as the demo on the trial page. If you are outputting to an audio file, it's slow.

Funktionar commented 2 years ago

I'm outputing to speakers and it's slower

Funktionar commented 2 years ago

should I switch to stream mode?

kxxt commented 2 years ago

How slow is it? I didn't experience significantly large delays compared with the demo.

Funktionar commented 2 years ago

third as slow

kxxt commented 2 years ago

I just did a profile:

python -m cProfile -m aspeak -t

         2860752 function calls (2854737 primitive calls) in 34.520 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   34.021   34.021 __main__.py:1(<module>)
        2    0.000    0.000    0.399    0.199 auth.py:1(<module>)
        1    0.000    0.000    0.879    0.879 auth.py:10(_get_auth_token)
        1    0.000    0.000   32.147   32.147 functional.py:11(pure_text_to_speech)
        1    0.000    0.000   33.983   33.983 main.py:122(main)
        1    0.000    0.000    0.948    0.948 main.py:18(read_file)
        1    0.000    0.000    0.000    0.000 main.py:25(preprocess_text)
        1    0.000    0.000   32.147   32.147 main.py:46(speech_function_selector)
        1    0.000    0.000   33.095   33.095 main.py:69(main_text)
        1    0.290    0.290   32.146   32.146 provider.py:36(text_to_speech)
      2/1    0.000    0.000    0.498    0.498 runpy.py:103(_get_module_details)
        1    0.000    0.000   34.520   34.520 runpy.py:199(run_module)
        1    0.000    0.000   34.022   34.022 runpy.py:63(_run_code)
        1    0.000    0.000   31.820   31.820 speech.py:1565(speak_text)
        1    0.000    0.000   29.846   29.846 speech_py_impl.py:6148(speak_text)

34.520s is the total time.
34.021s is the time spent in main.
33.893s spent on the main.py
32.146s spent on text_to_speech function in provider.py
0.948s spent on reading from stdin
0.879s spent on getting the auth token.
31.820s spent by azure's speech package to do the actual speech synthesis work which is out of my control.

So the space for optimization is 33.893 - 31.820 - 0.948 - 0.879 = 0.24600000000000044 Actually there is almost nothing to optimize, except:

https://github.com/kxxt/aspeak/blob/e3b1b4418f8d33ba4926f8216a235cbf3dd2e996/src/aspeak/api/provider.py#L40

We could cache the synthesizer here if you are always using the same parameters for text_to_speech.

kxxt commented 2 years ago

I can provide an API with cached SpeechSynthesizer in the next version but I'm very busy recently so don't expect that to arrive very soon.

You could do it yourself by building your own version of SpeechServiceProvider if you are always calling text_to_speech/pure_text_to_speech with the same set of parameters

Create and store your SpeechConfig and AudioOutputConfig in it.
Just cache the SpeechSynthesizer and recreate it using the same config in case of token expiration.
Modify text_to_speech and ssml_to_speech method on your SpeechServiceProvider to utilize the cached SpeechSynthesizer and remove the config parmeters from the methods.
Call SpeechServiceProvider.text_to_speech(text) or SpeechServiceProvider.ssml_to_speech(ssml) to do speech synthesis (You can create ssml using the create_ssml function in aspeak.ssml)

However, frankly speaking, I don't know by how mush will the performance improve.

kxxt commented 2 years ago

Actually I don't think the 200ms delay is realistic.

I opened https://eastus.tts.speech.microsoft.com in a browser and I got 268ms delay

Funktionar commented 2 years ago

Thanks

kxxt / aspeak

Faster audio output/processing #19