AudioStreamGeneratorPlayback delay

ikbencasdoei commented 3 years ago

Godot version: 3.2.4-rc4

OS/device including version: Windows 10

Issue description: I'm currently developing a new version of my godot-voip plugin which uses the new AudioEffectCapture for real time voice input. (see: https://github.com/casbrugman/godot-voip/pull/7) To take advantage of this I switched from using a regular AudioStreamPlayer to using a AudioStreamGeneratorPlayback to play the voice input in real time. However this introduced a significant amount of latency even when used locally. This does not seem like expected behavior and I'm not sure what causes this or what can be done about this.

Steps to reproduce: Push audio frames into an AudioStreamGeneratorPlayback.

Minimal reproduction project: godot-voip-e119e3bcadf98e37f0de2e3e3d1bfa4bba59dd7e.zip

Calinou commented 3 years ago

cc @lyuma as they implemented AudioEffectCapture in https://github.com/godotengine/godot/pull/45593.

It would be helpful if you could try decreasing the Output Latency in the Project Settings, but this setting won't be available on Windows until https://github.com/godotengine/godot/pull/38210 is merged.

ikbencasdoei commented 3 years ago

Its very much an issue with the AudioStreamGenerator because when using a regular AudioStreamPlayer I did not experience this issue. The AudioEffectCapture has been working great so far.

lyuma commented 3 years ago

This reply got a bit lengthy. TL;DR: I believe that the code in your reproduction project is responsible for the delay, and this behavior does not indicate a bug in Godot.

But you've pretty much hit what makes writing real-time audio code so complex and challenging, so I'll go into detail in the problem you ran into and some possible solutions (I'm sure there are other approaches, too).

What causes the extra delay

So the reason for that latency is the combination of two things. First, this code:

func _process_input():
    for i in range(_playback.get_frames_available()):
        if _receive_buffer.size() > 0:
            _playback.push_frame(Vector2(_receive_buffer[0], _receive_buffer[0]))
            _receive_buffer.remove(0)
        else:
            _playback.push_frame(Vector2.ZERO)

This code in the sample project is deliberately filling up the playback buffer. This is technically fine in terms of quality, but it does guarantee you incur maximal latency.

Second, the default buffer length defined in AudioStreamGenerator's constructor:

AudioStreamGenerator::AudioStreamGenerator() {
    mix_rate = 44100;
    buffer_len = 0.5;
}

The combination of the way you use it, and the default buffer size, ensures that you will always incur a latency of half a second. I suspect this is what you are observing (in a VoIP round trip test, you'll see this on both ends of the connection, so you might notice a whole second of latency, depending on how you are testing).

Mitigations

The godot_speech GDNative plugin has been successfully running with a smaller buffer length of 0.1, as follows:

var generator: AudioStreamGenerator = AudioStreamGenerator.new()
new_generator.set_mix_rate(48000)
new_generator.set_buffer_length(0.1)

Actually, the application we're working on is still using the above code at 0.1 seconds, with a similar loop to your example project, and it's "good enough" for now: While 100ms delay in VoIP is "acceptable", it's not as good as we can do.

You could stop reading here if you want. Or keep going if you want to know how to do even better.

How to avoid filling up the buffer

However, even the above is not perfect. One thing that can be done instead to dynamically determine the delay, is to write no frames until the AudioStreamGeneratorPlayback reports skips, as a way to learn that the playback thread has started processing. In your demo, it would be:

var skips = _playback.get_skips()
if skips < 1:
    return

However, since the code fills up the buffer at each _process tick, the code would need to be restructured.

One idea would be to use the amount of data available in the capture buffer to determine exactly how much to push to playback. However, this may lead to clicking if you buffer too little data.

Another idea is to make everything time-based. So every _process, you see what time it is, and how many frames should be inserted since the last call to _process. You'd still need to track skips and make sure there is enough data in the buffer that you don't underrun.

However, with these modifications, you're still tied to the game framerate. You will necessarily have additional delay if your game skips frames for any reason, and that delay will end up in your playback or capture buffers unless you have a means to clear it out.

Threads to the rescue

Finally, this brings me to what I probably ultimately recommend doing: use a thread to drive the AudioStreamGeneratorPlayback and the AudioEffectCapture buffers.

This thread can be decoupled with the main thread's processing loops, and therefore allows processing at a fixed interval.

WIth this approach:

Make a Thread to handle audio, and feed it references to the AudioEffectStream and the AudioStreamGenerator
Use get_playback_position() ((BUG: This is not exported to GDScript)) or get_skips() to determine when playback started; and
Use get_frames_available() on the AudioEffectCapture to determine when capture is started
And finally, create a tight loop, delaying based on depending on VoIP packet sizes (e.g. 10ms) [you'll want to use an absolute timer instead of a fixed sleep call to allow catching up if you sleep longer than desired]
In the loop, you can call get_buffer and push_buffer as needed. (NOTE: push_buffer and get_buffer are specifically designed to be usable by threads, as long as only one thread uses them. This is due to the single-producer, single-consumer RingBuffer architecture common in most audio systems)

ikbencasdoei commented 3 years ago

Thank you so much for this explanation!

Calinou commented 3 years ago

@lyuma We should probably amend the documentation and/or class reference for this 🙂

godotengine / godot