kcat / openal-soft

OpenAL Soft is a software implementation of the OpenAL 3D audio API.
Other
2.17k stars 526 forks source link

[Feature Request] Low latency output on Windows via ASIO/WASAPI exclusive #682

Open ThreeDeeJay opened 2 years ago

ThreeDeeJay commented 2 years ago

High audio latency (100ms+) is something that has plagued apps/games on Windows for a long time, yet people rarely notice, let alone measure it, like Matt Gore, HeSuVi developer and Battle(non)sense. image image

So I've been wondering if ASIO could be implemented into OpenAL Soft directly, since Crystal Mixer already did something like that, but AFAIK it's only capable of virtualizing the multichannel audio mix.

Alternatively, someone even modified OpenAL Soft to use WASAPI in exclusive mode, which I've tested and confirmed it does make a difference (tho I'm yet to measure it), so I forked it here Perhaps a flag to switch to exclusive mode in the main branch would be more feasible since it just seems to require a couple line edits (tho it would probably need some improvement so it's not restricted to sample-type=int16 and period_size is automatically set to the lowest supported by the sound card, as well as minimal/no mixahead or any other bottlenecks to ensure lowest possible latency).

Either option should hopefully allow ultra low latency on thousands of games that are at least potentially supported by OpenAL Soft by using the sound card's native ASIO or ASIO4ALL. I think audio would also be bit-perfect audio (at least on WASAPI exclusive) by bypassing the Windows mixer. So perhaps eventually including both would give people options based on their needs.

kcat commented 2 years ago

I'm curious how much of the latency is a result of using shared mode or non-ASIO output. "Button to audio" latency with random games doesn't say much, since it's also including input latency (the time from physically pressing a button to the OS detecting the input, then to the process detecting the input), and logic/frame latency (the time from the process getting input to processing a new logic frame, and from a logic frame to updating audio state, which can be at different rates), and only then getting to the audio latency.

OpenAL Soft itself will add about 50ms on average, given the default 20ms period size and 60ms buffer. Certain post processors may add a couple more milliseconds (output limiter, UHJ encoder, etc, which will be reported as "Fixed device latency: ..." in the trace log).

According to this page, starting with Windows 10 the default audio engine latency is 1.3ms, plus a 10ms default period size which will get written to the buffer for the hardware. So adding that all up, there should be about 51.3ms to 71.3ms if there's no other hidden latency anywhere. By changing OpenAL Soft's period size and period count properties, It could be reduced to a period size of 10ms and a 20ms buffer, which would make OpenAL Soft average 15ms, making the latency from OpenAL to output about 21.3ms to 31.3ms. Although this will have a higher risk of underruns.

Before Windows 10, there's an additional 11ms for floating point sample streams and 5ms for integer sample streams. APOs may add additional latency, but there's no information about if there's any used normally.

Alternatively, someone even modified OpenAL Soft to use WASAPI in exclusive mode, which I've tested and confirmed it does make a difference (tho I'm yet to measure it), so I forked it here https://github.com/ThreeDeeJay/openal-soft-WASAPI-exclusive/commit/9cd722fc9a80181cc9c86db9a0ec86728dafb7a3

Well, one apparent difference is it passes a bad period size to IAudioClient::Initialize (it passes the same size for the buffer and period size, using the buffer size, when the buffer size should be at least twice the period size), sets incorrect values for the OpenAL device's buffer and period size (sets the period size using the buffer size), and doesn't properly pace updates (whenever the mixer thread wakes up, it processes however many samples WASAPI says are available regardless if it's at the period size yet). It also seems to get the minimum period size before initialization and the buffer size after initialization, but does nothing with them. It's impossible to tell what the device is going to do with regards to buffering/latency.

mirh commented 2 years ago

I have always appreciated ASIO... if not any because I had a Xonar sound card, and even for my poor realtek I had found a *native* driver anyway (also, I think they had made some multiclient driver?). But is there really much of a point in 2022 over a "normal" api like IAudioClient3 in exclusive mode? https://github.com/mumble-voip/mumble/issues/1604

I mean, putting aside that I don't think games are meaningfully hampered by this. Academically speaking, is it worth at least 1ms? Or is it just a relic of another epoch when the windows mixer was called KMixer?

p.s. as far as WDM-KS workarounds go.. I believe FlexASIO was the current champ

mirh commented 2 years ago

Inb4 this is as good as exclusive https://github.com/miniant-git/REAL

EDIT: follow up is here

kcat commented 2 years ago

Inb4 this is as good as exclusive https://github.com/miniant-git/REAL

Not sure that would help too much. That simply forces the audio server/service to use a shorter update period, but the app's buffer size is left unchanged. Unless the app calculates a buffer size based on the device's period size, that would only cause more frequent updates for the same buffer size.

And actually for such cases, that would cause slightly higher overall latency since the buffer won't drain as much before doing another update. If the buffer is 40ms total, for example, the default 10ms period size would mean the buffer would have 30ms filled by the time an update occurs, meaning latency as low as 30ms for anything triggered just before the update; whereas if the period size is forced to 2ms (or whatever it sets), the same buffer will have 38ms filled when an update occurs, meaning latency closer to 38ms for anything triggered just before the update. So with the default period size, latency can vary between 30-40ms, whereas with a "low latency" 2ms period size, latency can vary between between 38-40ms, a notably higher average and minimum bound.

In the case of OpenAL Soft, it uses a multiple of the period size to stay close to its internal 20ms update size (or whatever period_size is set to), with a total buffer that's 3 times the size (or whatever periods is set to). So latency and update granularity should remain somewhat consistent regardless of what that does. It will just be woken up more often to check if there's enough writable space to do a full update, wasting CPU time. That would allow you to set a smaller period size since it won't be limited to a multiple of the 10ms default, instead a multiple of whatever that sets, but it won't do anything on its own.

ThreeDeeJay commented 2 years ago

Never had luck getting less than 10ms with REAL. image I should point out that I didn't revert back to the Microsoft drivers (which is optional anyway) cuz I wouldn't wanna lose 7.1/5.1 in both my internal/USB sound cards.

Enokilis commented 1 year ago

Well, one apparent difference is it passes a bad period size to IAudioClient::Initialize (it passes the same size for the buffer and period size, using the buffer size, when the buffer size should be at least twice the period size), sets incorrect values for the OpenAL device's buffer and period size (sets the period size using the buffer size), and doesn't properly pace updates (whenever the mixer thread wakes up, it processes however many samples WASAPI says are available regardless if it's at the period size yet). It also seems to get the minimum period size before initialization and the buffer size after initialization, but does nothing with them. It's impossible to tell what the device is going to do with regards to buffering/latency.

I'm the one who made the modifications a long time ago. It was just a quick hack as a proof of concept, and wasn't really meant to be shared, no pun intended.

While not a controlled experiment, I used Wireshark with USBcap to measure the delta between a mouse click and a response in the audio stream. The advantage of this approach is that the DAC's own latency is factored out, but it assumes Wireshark is precise enough to be useful. Using the lowest period size I could in shared mode, it tended towards 30 milliseconds and up, while in exclusive mode, it was typically around 20 milliseconds. As mentioned before, this is very app-dependent, and some OpenAL game could easily create a delay close to three digits of milliseconds, so exclusive mode is hardly a panacea.

ThreeDeeJay commented 2 months ago

Following kcat's suggestions here, I was able to compile OpenAL Soft with ASIO output via PortAudio: OpenALSoft+PortAudio+ASIO.zip sublime_text_45POugxw0M image sublime_text_7hZrPKb15p na3tXhQ8DC

However, I'm not sure how to force set buffer size to 64 samples (lowest my sound card can handle in native ASIO apps) for the lowest possible latency because even after setting period_size=64 it keeps resetting to much higher values and the slider in ASIO4ALL gets ignored, so perhaps there's something that I'm missing? 🤔 alsoft_error.txt

On a side note, adding DSOAL+RightMark3DSound.zip (specifically dsound.dll) breaks EFXShow for some reason, and RightMark3DSound crashes too with this build.

kcat commented 2 months ago

However, I'm not sure how to force set buffer size to 64 samples (lowest my sound card can handle in native ASIO apps) for the lowest possible latency because even after setting period_size=64 it keeps resetting to much higher values and the slider in ASIO4ALL gets ignored, so perhaps there's something that I'm missing? 🤔

Currently the way the PortAudio backend works is it opens and configures the output stream during alcOpenDevice, when the actual ALCdevice configuration isn't handled until alcCreateContext (and alcResetDeviceSOFT, etc), so the PortAudio stream gets configured with the default properties. A way to fix this would be to recreate the PortAudio stream in PortPlayback::reset, but that risks failing if the device doesn't like being reopened immediately after closing, or if there's trouble getting it working with a usable format, making the device unusable.

On a side note, adding DSOAL+RightMark3DSound.zip (specifically dsound.dll) breaks EFXShow for some reason, and RightMark3DSound crashes too with this build.

Probably a dependency loop. 0xC0000142 is STATUS_DLL_INIT_FAILED, and since PortAudio can use DSound, having OpenAL Soft load PortAudio, which loads DSound/DSOAL, which loads OpenAL Soft, creates a circular loop which causes the DLL to fail initialization.

mirh commented 2 months ago

On a separate note, if you have a realtek you should try native asio instead of asio4all.

ThreeDeeJay commented 2 months ago

@kcat Also I tried using the latest commit instead of latest stable version (from 2021) with no luck, and I found a PR that implements ASIO messages and rebuilt the dll with it, but Buffer Size Changed via the ASIO4ALL control panel still gets ignored. e.g. here I set it to request 64 samples (which works with native ASIO apps, at least after a restart) but it wouldn't change from 1920 as the log below shows image

Probably a dependency loop. 0xC0000142 is STATUS_DLL_INIT_FAILED, and since PortAudio can use DSound, having OpenAL Soft load PortAudio, which loads DSound/DSOAL, which loads OpenAL Soft, creates a circular loop which causes the DLL to fail initialization.

I wonder if disabling DirectSound support from PortAudio would get around that issue. It would be interesting to check whether apps/games using at least DirectSound would get low latency via this ASIO route 🤔

On a separate note, if you have a realtek you should try native asio instead of asio4all.

@mirh Sadly, my motherboard's onboard Realtek ALC1150 drivers don't include an ASIO driver, and I've tried the Dell Realtek drivers, even with this installer, but I ran into the same issue, even at 44100hz which some have reported to work more reliably:

[ALSOFT] (II) Created device 02E32960, "OpenAL Soft on Realtek ASIO"
[ALSOFT] (II) Found option frequency = "44100"
[ALSOFT] (II) Found option period_size = "128"
[ALSOFT] (II) Found option stereo-encoding = "hrtf"
[ALSOFT] (II) ALC_MAX_AUXILIARY_SENDS = 2
[ALSOFT] (II) Pre-reset: Stereo, Float32, *44100hz, 128 / 384 buffer
[ALSOFT] (II) Reported stream latency: 0.002979 sec (143.000000 samples)
[ALSOFT] (WW) Failed to set 44100hz, got 48000hz instead
[ALSOFT] (II) Post-reset: Stereo, Float32, 48000hz, 960 / 1920 buffer
[...]
[ALSOFT] (II) Post-start: Stereo, Float32, 48000hz, 960 / 1920 buffer

It also suffers from other bugs like inputs/outputs randomly disappearing, some apps reporting single, high buffer size like 888 or 960. Using Creative's generic ASIO drivers on my X-Fi is even worse so I just use ASIO4ALL which just works™️ 99% of the time.

ThreeDeeJay commented 2 months ago

I spy with my little eye 👀 https://github.com/kcat/openal-soft/commit/aafaf6c6669da366e1833d23f677ce81e4b7dda8 Good news, now it's reporting much lower buffer size, though not quite the lowest. I specified period_size=64 and periods=2 but ASIO4ALL still refuses to go below 128 for some reason. alsoft_error.txt

[ALSOFT] (II) Pre-reset: Stereo, Float32, 48000hz, 64 / 128 buffer
[ALSOFT] (II) Post-reset: Stereo, Float32, 48000hz, 64 / 172 buffer
[ALSOFT] (II) Post-start: Stereo, Float32, 48000hz, 64 / 172 buffer
[ALSOFT] (II) Pre-reset: Stereo, Float32, 48000hz, 64 / 128 buffer
[ALSOFT] (II) Post-reset: Stereo, Float32, 48000hz, 64 / 176 buffer
[ALSOFT] (II) Post-start: Stereo, Float32, 48000hz, 64 / 176 buffer

EFX10ShowWin32_p9LQ83FZ0k For reference, here's how it should look: image

mirh commented 2 months ago

Did you try with some older drivers? ALC1150 is at least a decade old (UAD in particular is pretty delicate) Or maybe with the hacked ones on TPU.

kcat commented 2 months ago

Good news, now it's reporting much lower buffer size, though not quite the lowest. I specified period_size=64 and periods=2 but ASIO4ALL still refuses to go below 128 for some reason. alsoft_error.txt

It can't go less than 128 with an update size of 64. Some samples need to be playing while new samples are being generated, which is accomplished with double-buffering, and 64x2 = 128. Though it looks like it's not going lower than 176 (~3.6ms), which is x2.75. That could be a limit of PortAudio, to ensure there's enough time to call for more audio before underrunning, but OpenAL Soft is only asking for 128-sample latency for double-buffering, and is getting back 176.

ThreeDeeJay commented 2 months ago

Did you try with some older drivers? ALC1150 is at least a decade old (UAD in particular is pretty delicate) Or maybe with the hacked ones on TPU.

@mirh Any idea if those drivers perform any differently than ASIO4ALL? 🤔 Seems a bit tedious and unsafe if it might also require disabling driver signature enforcement to install modified drivers. I even had to revert drivers R2.83 released earlier this year because 7.1 surround configuration was missing so I went back to R2.82 from like 2017 lol

It can't go less than 128 with an update size of 64. Some samples need to be playing while new samples are being generated, which is accomplished with double-buffering, and 64x2 = 128. Though it looks like it's not going lower than 176 (~3.6ms), which is x2.75. That could be a limit of PortAudio, to ensure there's enough time to call for more audio before underrunning, but OpenAL Soft is only asking for 128-sample latency for double-buffering, and is getting back 176.

@kcat I noticed there may be a pattern here:

So if my guess and math are right, ActualBuffer = (period_size * 2) + (frequency/1000), then given ActualBuffer = 64 and frequency=48000, period_size would need to be 8 but that's way below the acceptable values.

Math (`period_size` * 2) + (frequency/1000) = 64 ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞2 ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞ ͞   ͞ ͞ ͞2 ͞ `period_size` + [(48000/1000)/2] = 32 `period_size` + (48/2) = 32 `period_size` + 24 = 32 `period_size` = 32 - 24 `period_size` = 32 - 24 `period_size` = **8**

Alternatively, If I could set periods=1, I could just set period_size=16 so that extra + 48 at 48000Hz adds up to 64, but periods=1 isn't acceptable either.

So would it be feasible to lower those limits to compensate for that extra buffering that's added anyway? I wonder if it'd increase CPU usage significantly tho, at least compared to native ASIO at 64 buffer.

Also worth noting that I'm still able to use 64 samples even at up to 192000hz (max supported in general) in other ASIO apps without a single crackle. image

mirh commented 2 months ago

I don't have any idea, other than audio vendors not having needed ASIO in the first place if WDM-KS had been enough. Anyhow, whatever it's just an audio driver. Even R2.79 has its admirers

mirh commented 1 month ago

There is some shaky report the new W10 low latency mode may get you 3ms latencies, but it's really freaking annoying how no competent developer can seem to independently get it to work and confirmed.

p.s. as for the realtek asio driver, I found mixed opinions: one super positive, one neutral (old version sucks royally while new one is good, but pretty much the same of WASAPI) and another negative.

ThreeDeeJay commented 1 month ago

<5ms latency in shared mode sounds too good to be true, but then again so did the graphics equivalent (fullscreen optimizations/flip model or whatever it's called) add it really turned out to be a decent middle middle ground between the performance and latency of exclusive fullscreen, without its inconveniences like not being able to draw regular windows on top of it and non seamless alt-tabbing, so I wonder if this would be feasible here as well, to reduce inconvenience and extra setup for the end user 🤔

dechamps commented 1 month ago

<5ms latency in shared mode sounds too good to be true

There's really no reason why that shouldn't be possible, but to me the main caveat is this "low latency shared mode" apparently requires explicit support from the audio driver. I don't know if typical drivers offer such support (hopefully at least the Microsoft USB Audio drivers and Realtek drivers do, otherwise that's a huge chunk of the market left unadressed). I've never really looked into this particular feature.