We should use the CPU on the device to already compute the next utterance while the current utterance is being spoken. For detecting the end of the utterances, we can use the same heuristics as being currently used. There should be a queue for the speak task, that decouples the current utterance from the processing. This queue should be of size 1 and the consumer (the one doing the processing and feeding the audio samples to the TTS service) should blockingly wait on the queue. We should examine, if we could use this approach also for network voices, although here we have a much smaller delay.
The challenge is to signal to TTS that we have consumed the current utterance, although we haven't. We use the synthCb.audioAvailable() call to feed the TTS service with audio. We could play e.g. the first utterance with a dummy silence and directly call synthCb.done() afterwards so that we get the next utterance / keep the sequence of utterances coming. Then we'd feed the TTS audio via synthCb.audioAvailable() with the really computed utterance n-1 while the next utterance is being prepared. We do all of this until we get an end-of-utterance signal from TTS service and don't call synthCb.done() for the last-1 utterance, wait for the last utterance to be computed and instead call synthCb.audioAvailable() with the last computed utterances and only afterwards execute synthCb.done() at the end to keep the number of calls to synthCb.done() the same as number of executed callbacks.
This has to be thoroughly tested, with all possible combinations and error situations in between.
If we'd be doing that, we are probably working against the protocol and could not make happen e.g. #96. We shouldn't introduce any speed increases by the cost of reliability/usability
We should use the CPU on the device to already compute the next utterance while the current utterance is being spoken. For detecting the end of the utterances, we can use the same heuristics as being currently used. There should be a queue for the speak task, that decouples the current utterance from the processing. This queue should be of size 1 and the consumer (the one doing the processing and feeding the audio samples to the TTS service) should blockingly wait on the queue. We should examine, if we could use this approach also for network voices, although here we have a much smaller delay.
The challenge is to signal to TTS that we have consumed the current utterance, although we haven't. We use the
synthCb.audioAvailable()
call to feed the TTS service with audio. We could play e.g. the first utterance with a dummy silence and directly callsynthCb.done()
afterwards so that we get the next utterance / keep the sequence of utterances coming. Then we'd feed the TTS audio viasynthCb.audioAvailable()
with the really computed utterancen-1
while the next utterance is being prepared. We do all of this until we get an end-of-utterance signal from TTS service and don't callsynthCb.done()
for the last-1 utterance, wait for the last utterance to be computed and instead callsynthCb.audioAvailable()
with the last computed utterances and only afterwards executesynthCb.done()
at the end to keep the number of calls tosynthCb.done()
the same as number of executed callbacks.This has to be thoroughly tested, with all possible combinations and error situations in between.