kxxt / aspeak

A simple text-to-speech client for Azure TTS API.
MIT License
494 stars 57 forks source link

Faster audio output/processing #19

Closed Funktionar closed 2 years ago

Funktionar commented 2 years ago

Possible to use this in real-time communications? Compared with just azure it's slower and I have the deepl API to talk with foreigners. I'd like to get the audio within 200 ms and output it to a sound device, if it's feasible.

kxxt commented 2 years ago

Outputting to default speaker should be as fast as the demo on the trial page. If you are outputting to an audio file, it's slow.

Funktionar commented 2 years ago

I'm outputing to speakers and it's slower

Funktionar commented 2 years ago

should I switch to stream mode?

kxxt commented 2 years ago

How slow is it? I didn't experience significantly large delays compared with the demo.

Funktionar commented 2 years ago

third as slow

kxxt commented 2 years ago

I just did a profile:

python -m cProfile -m aspeak -t
         2860752 function calls (2854737 primitive calls) in 34.520 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   34.021   34.021 __main__.py:1(<module>)
        2    0.000    0.000    0.399    0.199 auth.py:1(<module>)
        1    0.000    0.000    0.879    0.879 auth.py:10(_get_auth_token)
        1    0.000    0.000   32.147   32.147 functional.py:11(pure_text_to_speech)
        1    0.000    0.000   33.983   33.983 main.py:122(main)
        1    0.000    0.000    0.948    0.948 main.py:18(read_file)
        1    0.000    0.000    0.000    0.000 main.py:25(preprocess_text)
        1    0.000    0.000   32.147   32.147 main.py:46(speech_function_selector)
        1    0.000    0.000   33.095   33.095 main.py:69(main_text)
        1    0.290    0.290   32.146   32.146 provider.py:36(text_to_speech)
      2/1    0.000    0.000    0.498    0.498 runpy.py:103(_get_module_details)
        1    0.000    0.000   34.520   34.520 runpy.py:199(run_module)
        1    0.000    0.000   34.022   34.022 runpy.py:63(_run_code)
        1    0.000    0.000   31.820   31.820 speech.py:1565(speak_text)
        1    0.000    0.000   29.846   29.846 speech_py_impl.py:6148(speak_text)

So the space for optimization is 33.893 - 31.820 - 0.948 - 0.879 = 0.24600000000000044 Actually there is almost nothing to optimize, except:

https://github.com/kxxt/aspeak/blob/e3b1b4418f8d33ba4926f8216a235cbf3dd2e996/src/aspeak/api/provider.py#L40

We could cache the synthesizer here if you are always using the same parameters for text_to_speech.

kxxt commented 2 years ago

I can provide an API with cached SpeechSynthesizer in the next version but I'm very busy recently so don't expect that to arrive very soon.

You could do it yourself by building your own version of SpeechServiceProvider if you are always calling text_to_speech/pure_text_to_speech with the same set of parameters

However, frankly speaking, I don't know by how mush will the performance improve.

kxxt commented 2 years ago

Actually I don't think the 200ms delay is realistic.

I opened https://eastus.tts.speech.microsoft.com in a browser and I got 268ms delay image

Funktionar commented 2 years ago

Thanks