Overhaul audio spatialization to allow for greater flexibility (occlusion, HRTF, …)

ellenhp commented 3 years ago

Describe the project you are working on

Godot audio

Describe the problem or limitation you are having in your project

I've been trying to get an audio occlusion system for godot built. I don't think it belongs in core necessarily given how one of Godot's huge advantages is being lean and mean. But currently there's no way to integrate tightly with the audio system other than being in core, so this proposal aims to solve that issue.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

This is a big change and I'd love feedback on ways to make it less big while still accomplishing the same goals.

The core issue is that SPCAP is the way audio gets spatialized and there's no way to change that because there's no way to directly feed samples into the audio system from a script or plugin, other than an audio stream generators which are not performant and have high latency. For people who want more than SPCAP can provide like occlusion, geometric audio, HRTFs, etc, there aren't any options other than custom engine builds. There is at least one project that has resorted to shimming in Google's Resonance Audio to get HRTF stuff working. These shims are hard to maintain because they rely on patches that override the behavior of the audio player nodes.

If we decide we need the ability for a plugin to control spatialization from scripts/plugins, I think we need a way for a script to register a spatialization algorithm with the AudioServer. I propose a new Spatializer class that can be subclassed in script-world to provide this functionality. I'll go into more detail about what it does later, but its job is generally just answering the question "given a listener transform and a sound source position and emission angle, tell me how to mix this audio to the various output channels".

That's great, but now it's up to the Spatializer class to support stereo, 3.1, 5.1 and 7.1, and if I build a spatializer that uses information about the 3d world to do occlusion, but someone else builds a spatializer that supports HRTF based binaural audio, they can't be combined. The problem here is that our internal representation for sound is in its final form, so if we're mixing for a 7.1 system, the entire audio system needs to know about that.

That's where the more controversial component of this proposal comes in which is using spherical harmonics (ambisonics) to represent directional sound fields internally. The spatializer takes information about a sound source and listener position, and determines what the sound field from that source will sound like at the listener's position, then a new Decoder class is added to decode that ambisonic soundfield into whatever channel format the audio driver is using.

Decoders for stereo, 3.1, 5.1 and 7.1 will ship with the engine, and HRTF decoders can be added via plugins.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

There are two big components to this that I'll explain in detail here, and I'll hand-wave most of the glue and integration work.

Spatializers take a sound source position, listener position and probably a good amount of metadata about the sound source and figure out how to mix sound from that sound source's AudioStreamPlayback into an ambisonic soundfield at the listener's position. They can do this by casting rays to determine occlusion, room size, estimating early reflections, whatever. It's all up to the plugin author and the sky's the limit because this all happens outside of the audio thread and it's in script-world. They'll work is by creating an audio effect chain (one chain for each ambisonic channel). The stock SPCAP spatializer would, for example, create a high-shelf effect. Once the effect chain is created, the Spatializer is free to change parameters every frame. This mirrors how SPCAP works.

Critically, spatializers never deal with audio frames themselves. All they do is compose filter chains. Each filter will be implemented in core and should be very performant.

Decoders operate under the same sort of principle, but instead of taking a pair of transforms and determining the soundfield at the listener's ear, it takes the soundfield at the listener's ear and determines what things should sound like on a given output format. Decoders for standard formats like stereo, 3.1, 5.1 and 7.1 are all fairly simple, but HRTF decoders are a bit more complicated since they generally involve applying a FIR filter to each ambisonic channel-stereo channel pair. This is how Google's Resonance Audio is able to spatialize sound for many sources at once even on low-end devices. HRTF's are applied only at the end, all mixing is done with the ambisonic/spherical harmonic representation of the sound.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No workarounds for this are possible.

Is there a reason why this should be core and not an add-on in the asset library?

The goal here is to support add-ons. That said, this might just be something that gets punted from core because anyone who needs this kind of audio can probably use their own custom builds. I'd obviously prefer we have a discussion about whether something like this could be included in 4.x eventually because I think audio is important.

I'd also love to have a discussion about less invasive ways of accomplishing audio occlusion, too, since I think that's the most important piece here for 4.x. Geometric reverb (estimating room impulse responses), HRTF audio, etc, are all nice to have but I think lots of people need sound to not go through walls for their games. That's pretty common I think. Perhaps that's common enough to include in core directly and we can bypass all this nonsense required to control the audio system from script?

Calinou commented 3 years ago

I'd also love to have a discussion about less invasive ways of accomplishing audio occlusion, too, since I think that's the most important piece here for 4.x. Geometric reverb (estimating room impulse responses), HRTF audio, etc, are all nice to have but I think lots of people need sound to not go through walls for their games. That's pretty common I think.

I'm not sure how common sound occlusion is in published games. My experience tells me it's not that common, even in AAA games. Still, it would be interesting to see how well this can be simulated via a script. For instance, we could smoothly decrease the volume and/or radius of an AudioStreamPlayer3D or apply an effect if a raycast between the listener and the AudioStreamPlayer3D detects a collision. Different surface types would have to be handled using the PhysicsMaterial system if needed, and specific surfaces could be excluded from the occlusion check using physics layers and masks.

HRTF audio is also controversial among gamers, even in competitive games where it's supposed to help you locate sounds. Therefore, I wouldn't hold my breath for having HRTF support in core either. It's the motion blur of audio, if you prefer. The feature is technically superior, but lots of people will dismiss it regardless :slightly_smiling_face:

ellenhp commented 3 years ago

HRTF audio is also controversial among gamers, even in competitive games where it's supposed to help you locate sounds. So I wouldn't hold my breath for it happening in core either. It's the motion blur of audio, if you prefer 🙂

No HRTF in core. Hooks for HRTF maybe, but I mostly use Godot for flash game-quality HTML5 exports so file size is paramount. :)

I'm not sure how common sound occlusion is in published games. My experience tells me it's not that common, even in AAA games. Still, it would be interesting to see how well this can be simulated via a script. For instance, we could smoothly decrease the volume and/or radius of an AudioStreamPlayer3D or apply an effect if a raycast between the listener and the AudioStreamPlayer3D detects a collision. Different surface types would have to be handled using the PhysicsMaterial system if needed, and specific surfaces could be excluded from the occlusion check using physics layers and masks.

That's interesting. I'm not much of a gamer so I don't know about how common audio occlusion is either.

Perhaps another approach would be adding a new Audio player node from GDNative that can send some of its audio to a "Muffled" bus depending on how occluded it is. A ray cast through the physics world is simple enough. Once https://github.com/godotengine/godot/pull/51296 is in, it should be possible to have GDNative nodes that partially divert their audio, assuming we follow up by exposing the playback registration functions to script.

ellenhp commented 3 years ago

After thinking about it for a bit longer, I do think that this won't be necessary once audio stream playback objects can be sent to the audio server bypassing all the audio player nodes. That combined with the new native extension system will allow custom audio player nodes to be created that can do whatever the user requires. It's inconvenient to have to switch all of the audio players over to a different node type, but game development is a labor intensive process and that isn't even close to being the most tedious part of creating a game with fancy audio. That seems like a decent enough way to control the audio system from script, though. Closing! I am glad I brought this up though because it's good to have discussions. :)

godotengine / godot-proposals