mackron / miniaudio

Audio playback and capture library written in C, in a single source file.
https://miniaud.io
Other
3.96k stars 348 forks source link

Sound Layering and Volume Adjustment Example? #26

Closed KTRosenberg closed 5 years ago

KTRosenberg commented 6 years ago

I am trying to learn to use mini_al to layer looping (streamed) background music with one-shot sound effects (e.g. footsteps), but there don't seem to be examples for layering sounds--only opening a single audio file. Also, it is unclear how one would modify volume. I am sure that all of this is doable, but are there more advanced examples floating around?

mackron commented 6 years ago

mini_al is a low level library that cares only about sending and receiving raw audio to and from devices. What you would need to do is your own mixing and volume control in software before sending the audio to device. Mixing is just a matter of summing the samples of your audio sources. Volume control is just a scale.

This kind of stuff is sort of separate from mini_al, thus why I haven't yet done an example.

KTRosenberg commented 6 years ago

Okay, that is fair enough. The basic example shows a function reading frameCount frames. I do not see where I would create multiple audio sources that I could process myself since (from my understanding) whatever sound I load is wholly tied to the device --or have I missed a way of registering multiple audio sources and accessing them for processing?

mackron commented 6 years ago

That's where doing your own mixing comes into it. The simple playback example uses just one decoder, but nothing is stopping you from creating two or any other amount. It's just that you will need to read frameCount frames for each decoder and sum (and clip) the samples of each one before sending it to the device.

I will look at doing an example when I get the chance, but it sounds like mini_al, in its current state, may not be the best tool for you right now. It's intended for people who need only raw low level data transmission who would rather do their own mixing. High level mixing APIs are planned for the future but that's a ways away.

KTRosenberg commented 6 years ago

I see. It looked like I could only pass one decoder at any one time, but that's completely wrong. If I understand correctly, I can create my own argument struct and cast it in the callback.

mackron commented 6 years ago

That's correct. When you initialize the device you can specify the pUserData parameter which will be passed to the callback.

mackron commented 6 years ago

Correction: pUserData will be set to the pUserData variable of the device object.

KTRosenberg commented 6 years ago

Great. I think that things are clearer. In the meantime I might still play with something higher level but I get the sense that I am close to figuring what I need.

This is approximately what I think I need to do, but I'm just guessing:

-create a temporary buffer per audio source (two for simplicity) -create an args struct to pass as userdata and cast in the callback

Here is where I begin to be a little unsure: -create a temporary buffer for each decoder source (probably statically allocated) that is at least sizeof(sample in audio source) * frameCount (is that right?)

do

mal_uint32 count =  (mal_uint32)mal_decoder_read(args->devs[0], frameCount, (void*)audio_buff_0);
mal_uint32 count =  (mal_uint32)mal_decoder_read(args->devs[1], frameCount, (void*)audio_buff_1);

-use an algorithm on the individual buffers to apply effects such as volume control or reverb -sum contents of each temporary buffer into one temporary buffer -copy the final temporary buffer contents to the samples void*.

-alternative: create one temporary buffer, for each audio source, process, copy into the void* arg buffer

Am I on the right track?

Thanks again.

mackron commented 6 years ago

Yep, you're definitely on the right track. Just make sure you mix in the right format (float, int16, etc) and turn the volume down when you first test in case you mess up :)

KTRosenberg commented 6 years ago

Okay, last question then: how do I typically get the format I need to use for the final buffer? Is there a field attached to the device struct?

Also, to get the format of my individual audio sources, do I switch/case on a decoder's outputFormat field, or is it something else?

It sounds like mixing different formats could get tricky. I would have to convert into the final format--which if I'm not mistaken is one of the main features of this library: conversions. Hopefully I can pass-in a sub-buffer somewhere to do the conversions for me.

Also, yes: low volume debugging. :)

mackron commented 6 years ago

mini_al will do all format conversions for you. Pick a format that works best for you (I usually use f32). Set that as the format when you initialize the device. Set the output format for each of your decoders to the same format you specified when initializing the device so everything is consistent. When mixing you can assume everything is in that format you originally specified.

KTRosenberg commented 6 years ago

Thanks again for your patience. Just to follow-up, I ran into a mini stumbling block before when using the following to decrease the volume: (I set the format to float32)

    float factor = 0.5f;

    for (size_t i = 0; i < frames_read; ++i) {
        ((float*)p_samples)[i] *= factor;
    }

I tested with a 16-bit 44.1 kHz 2-channel stereo file, but you said that mini_al does the conversions for me so the 16-bit part shouldn't matter.

Multiplying frames_read by 2 resolved the problem. I think it's related to the number of channels, though I thought that each sample mapped to one of the stereo channels. hmm

In any case, I think that I have basic volume adjustment working now. Would it be helpful to provide a small example based on my own code for use in the repository? (Of course you would verify first).

mackron commented 6 years ago

Sample data is interleaved so your loop needs to take into account your channel count. Your loop needs to be changed to look something like the following:

float factor = 0.5f;

for (size_t i = 0; i < frames_read; ++i) {
    for (size_t c = 0; c < channel_count; ++c) {
        ((float*)p_samples)[i*channel_count + c] *= factor;
    }
}

Going forward, are you able to use email instead for these kind of questions. I just like to keep the GitHub stuff specific to development rather than user support.

RandyGaul commented 6 years ago

If you do user support here on github, then other users can search and find useful information. It's also a good signal to others that the repo is active :)

KTRosenberg commented 6 years ago

In any case, everything is working now. I do really like this API and wish there were more examples because it deserves more exposure.

Right now I applied a really basic algorithmic reverb in the callback and it was very simple to do.

However, I realize that to change settings in my user data (read in the callback, written by the main thread), I'd need to apply more thread locking, is that right? That's the only drawback I can think of now.

mackron commented 6 years ago

It depends on you situation, but a lock-free ring/circular buffer is a common technique. It's not super difficult, but I wouldn't call it trivial either so you will want to do a bit of research.

Writing this comment has got me thinking, though - I really should write a lock-free ring buffer helper API in mini_al...

KTRosenberg commented 6 years ago

Right, that probably is a good idea. For now I'm using an api called the "concurrency toolkit," which has a single consumer, single producer fifo ringbuffer. I'm just going to pass messages to and from the audio thread. I assume it's safest to handle initialization of decoders in the main thread, so that adds a little complexity (or maybe it's safe to load files in the audio thread.) Anyway, I'm wondering: would it be helpful to provide a stripped down example of volume control similar to your basic example, now that I have that working?

mackron commented 6 years ago

It's possible to initialize a decoder in the audio thread but I highly discourage it. You should try to keep the timing of the audio thread as stable and efficient as possible.

Don't go to the trouble of writing an example (thanks for your offer!) because I will be doing a mixing example (with volume control) when I get mini_al's mixing APIs in. Also, every example adds a maintenance cost, and they're intended to show people how to use mini_al, not how to do general audio programming (there's other sources of information around the internet for that).

BareRose commented 6 years ago

Was going to post a separate issue for this but it's relevant to mixing so I'm posting it here:

The documentation in mini_al.h's comments says:

Sample data is always little-endian

Is this accurate? If so, why little-endian over system-endian? Assuming it is accurate, doesn't this significantly complicate mixing as endian conversion becomes necessary (at least for big-endian systems) when operations need to be performed on sample data (such as adding them together for mixing)? Or am I missing something here?

mackron commented 6 years ago

Yes now that you mention it I think system-endian is technically correct. I've just always assumed LE since that's all I've ever explicitly supported in mini_al. mini_al has never been tested on BE architectures by the way.

BareRose commented 6 years ago

I did some further digging through your code, and it looks like mini_al only uses regular operations on its values, which don't care about endianness. Endianness only becomes relevant when you deal with binary data from an external source (network, file, etc...). It seems to me that mini_al itself is endian-proof (endian-agnostic, even).

However your wav loader is a different story. Perhaps I'm misreading the intent of your code, but the various drwav__bytes_to_XX look incorrect to me. There's an endianness check in them that isn't actually necessary (counter-intuitive, I know, but that's endianness for you) as bitwise operations always operate on big-endian representations - a fact that can be abused to write endian-proof code without the need for endian detection. Basically just remove the if/else parts from those functions and you should be fine, for example:

if (drwav__is_little_endian()) { //remove
    return (data[0] << 0) | (data[1] << 8); //keep just this line
} else { //remove
    return (data[1] << 0) | (data[0] << 8); //remove
} //remove

With this issue fixed (assuming there aren't similar problems with the loaders for other formats) mini_al should work perfectly fine on big-endian (though it'll be hard to actually test this since big-endian machines are hard to find, windows doesn't even run on them). To avoid future confusion the aforementioned comment should be changed to say "system-endian" or simply removed.

mackron commented 6 years ago

RIFF/WAV is always little-endian, thus the byte ordering needs to be shuffled when reading into a big-endian encoded variable. At least, that was what I was thinking when I wrote that. I will study this further when I can.

BareRose commented 6 years ago

Precisely, and the "keep just this" line fully accomplishes that, which is why the conditional is unnecessary (and outright harmful as it produces incorrect results on BE). Studying the issue further is a good idea eitherways, though.

mackron commented 6 years ago

I've been pondering this and it just isn't working in my head (maybe I'm just being stupid - quite possible). Continue using drwav__bytes_to_u16() as our example just for simplicity. The input to this function is just a byte[2] in little-endian order. What I'm failing to see how using the same byte[2] -> uint16 reconstruction between both endian-ness would work the same way. I understand that you can visualize C/C++ bit shifting as a big-endian operation, but that still doesn't make it work in my head. The whole point of the different endian-ness is that the locations of each byte within the integer are stored in different locations. It's just not adding up in my head how each byte would be shifted into the same locations for both LE and BE and that it would just work between the two.

The memory layout of a 16-bit integer looks like the following, right?

Example: uint16 = 0xABCD

Little-Endian
---------------
|  0   |  1   |    <-- Memory location
---------------
| 0xCD | 0xAB |
---------------

Big-Endian
---------------
|  0   |  1   |
---------------
| 0xAB | 0xCD |
---------------

Assuming the above is correct, wouldn't I need to shift the different bytes into different locations within the integer, depending on endian-ness? I feel like I need to see a running example of this... Do you know of a way I could experiment with this on a qemu virtual machine or something?

(I'm not suggesting you're wrong, by the way - just trying to understand everything properly.)

BareRose commented 6 years ago

(I'm not suggesting you're wrong, by the way - just trying to understand everything properly.)

No worries, I'm happy to try and explain this. Took me a while to get it right when I first had the misfortune of dealing with endianness myself.

Your idea of the memory layout is correct.

Bitshifting uses what are is essentially big-endian, though strictly speaking it doesn't have any endianness since it usually happens inside registers or similar rather than in memory. Another way to think about it is that bitshifting only sees 16 bits, not 2 bytes, so byte-order is irrelevant since there are no bytes.

In the byte[2] -> uint16 example, you're putting together the 8 bits of the two bytes into a 16-bit number with data[1] making up the 8 most significant bits and data[0] the 8 least significant bits. When the resulting 16-bit number is then stored in memory it'll be stored in the native endianness of the system. In a sense, the effectively big-endian 16-bit number from the bitwise operation is implicitly converted into whatever the system's native endianness is when it gets put into memory because that's how endianness works.

I'm not aware of any way to emulate big-endian. Desktops are virtually all little-endian so you'd have to find some other piece of hardware to test this on.

mackron commented 6 years ago

Yeah I understand, and I should have known better - it's obvious now. What makes it easier for me is to think about the input data as uint16 instead of bytes:

uint16 data16_0 = data[0];
uint16 data16_1 = data[1];

return (data16_0 << 0) | (data16_1 << 8);

Clearly the above would work the same across endian-ness. I will get that fixed up soon. Thanks @BareRose!

KTRosenberg commented 6 years ago

Not to go too off-track, is there a way to get this working for more than just primitive types when, for example, serializing/deserializing arbitrary structs to and from binary? I gather that I would have to know the component types ahead of time (char, pointer, int32, padding etc.) and do or-ed bitshifts on different bytes. Is that right?

mackron commented 6 years ago

That is correct. You need to endian-swap each individual members of the structure.

BareRose commented 6 years ago

Glad to help! Using uint16 and friends for arguments in your endian-swapping methods is probably a good idea.

Really looking forward to mini_al getting mixer functionality even if it's only basic. For now I'll have to use sts_mixer I guess. BTW making the mixer a separate library might be a good idea both to avoid bloating mini_al and allow people the flexibility to use their own backends or mixers respectively. You could call the mixer library "mini_am" for "mini audio mixer".

Clownacy commented 6 years ago

I know beggars can't be choosers, but I'd love to see a standalone mixer library.

mackron commented 6 years ago

I'm mildly surprised at how many people request mixing support. However it's very unlikely that I'll be making it a separate library simply because I would rather just keep it all in one place.

Regarding bloat, keep in mind that you can disable mini_al components with #define MAL_NO_DEVICE_IO and #define MAL_NO_DECODING if you don't need that stuff (and I will do something similar with the mixing API).

mackron commented 6 years ago

@BareRose In case you're interested, I've redefined formats to be system-endian (07410da499fb3eb5a5112d87df1fb486ee82fcbb). Will be in the next release.

BareRose commented 6 years ago

Is there a particular reason you need format constants with BE/LE suffixes? Just calling them MAL_SND_PCM_FORMAT_S16 and such (without the suffix) should be enough since endianness is implicitly system-endian anyway. Or is there some backend that cares about BE vs LE?

I'm mildly surprised at how many people request mixing support.

Probably cause of multithreading. Seems like the multithreading involved in a good mixer API would quickly get complicated, especially if you need an API for stopping currently playing sounds (or moving emitters/listeners while 3D sounds are playing). Maybe I'm just being dumb at multithreading, though.

I would rather just keep it all in one place.

You can do that and still make the mixer a separate library by having it share the same repo. I.e. you'd just have a second header file that usually gets included alongside mini_al but also works on its own.

mackron commented 6 years ago

Is there a particular reason you need format constants with BE/LE suffixes? Just calling them MAL_SND_PCM_FORMAT_S16 and such (without the suffix) should be enough since endianness is implicitly system-endian anyway.

I actually never realized those were available with ALSA. Will use them instead I guess. Thanks! I'm expecting the same applies with a few other backends...

You can do that and still make the mixer a separate library by having it share the same repo.

I'm doing a similar thing with my dr_libs repository, and it's actually really annoying (at least with my workflow) because it complicates versioning and branch management. Also, needing to duplicate little things like mal_min(), mal_uint32 etc. is just annoying - it's just easier if it's all in one place.

By the way, do you guys have some rough ideas on what the mixing API should look like? The most important requirements (off the top of my head) are that it must be simple and should be lock-free.

BareRose commented 6 years ago

Any sane backend is going to have system-endian formats because those are what you want if you're doing any kind of processing at all (such as mixing). Note that not all backends are necessarily sane.

BTW, how would you handle clamping in your mixer API? Just simple hard clamps or something more sophisticated?

Maybe I'll get around to writing my own mixer after all, having looked into it more it really doesn't seem that hard (beyond multithready stuff, for which a bit more practice certainly wouldn't hurt).

mackron commented 6 years ago

I would keep it simple and just do a regular of clip. I think with SSE you can do it without branching.

The trick with mixing and multithreading is making sure you don't lock in the audio thread. No mutexes allowed! :)

KTRosenberg commented 6 years ago

It might be helpful to include some sort of ready implementation of a lock-free / wait-free message queue type so the main thread and audio thread can communicate back-and-forth. I'm currently using the conkurrency toolkit for now. Do you think it would be useful to have a library-specific data structure for that, or does it make more sense to use whatever else we'd like from other vendors?

I'm in favor of anything that helps streamline the process of sending custom commands back and forth--also some basic master volume, panning, fading, etc. for streamed audio + in-memory audio.

BareRose commented 6 years ago

Does C even have portable SSE? Or are you talking about going the inline assembly route? Eitherways branch-free clamping sounds very attractive.

If you want to stick to mini_al being a purely low-level toolset for people to build their own higher-level stuff with you could simply provide all the primitives/utility funtions (SSE clamping function, lock-free messaging, stuff KTRosenberg mentioned, etc...) needed to build a custom mixer instead of a full mixer API, then have a simple example that shows how these primitives can be used to make a custom mixer.

Even if you do include a full mixer API you could still expose the lower-level stuff it uses for people to use when they need a custom mixer for something domain-specific.

mackron commented 6 years ago

Yes, all major C compilers have (mostly) portable SSE/AVX/NEON intrinsics. Though some compilers have some very frustrating differences with how you enable support for it at compile time... mini_al already has the beginnings of some SSE2, AVX2 and NEON optimizations in there (including a branchless clamp buried in there somewhere if I remember correctly).

Eventually I want mini_al to be both low level and high level, but for now I'm focusing on getting the low level stuff working well. I intend on exposing every layer of the API, but I don't know exactly what it'll look like just yet (and it's looking like it'll be a long way out - patience!)

BareRose commented 6 years ago

Ended up needing SSE2 intrinsics for the rANS coder I made as a side-project. They're surprisingly easy to use once you get used to them. I found this reference very helpful. Doesn't have NEON, I assume that's AMD rather than Intel?

Learning more about SSE has made me really want to make my own mixer now, plus I really need to practice my multithreading anyway. What particular multithreading primitives/patterns would you recommend for a mixer? And do you have any articles/resources you would recommend for these? I've worked with pthreads, semaphores, and simple atomic before, so I'm not a total noob.

You mentioned a lock-free ringbuffer before. I assume you'd implement some kind of command pattern with these where calls to say, a playSound function would post a message saying "play this sound with this gain and pan" to the ringbuffer/queue-thing, then the audio thread consumes these messages in its callbacks and SIMD's together all the different samples and such?

I assume with "branch-free clamping with SSE" you're referring to _mm_adds_epi16? Instinctively 16-bit feels like it's not enough, but I'm not an audio programmer so what do I know. I'd probably want to stick to only the most commonly supported instruction sets, among which I haven't seen any saturated add for more than 16-bit integers. Virtually every modern processor supports SSE2, but SSE3 and up less so. Not sure about AVX, especially with all the sub-variants of AVX-512.

RandyGaul commented 6 years ago

I’ve got a cross platform mixer in SSE2 using ring buffers on my cute headers repo. You don’t need much synchronization at all. You will probably want to use ring buffers not because they can be implemented lockless (for single producer single consumer), but instead because underlying APIs will expose ring buffer based APIs. You should take a look — sounds exactly like what you want to implement.

mackron commented 6 years ago

The branchless tricks mentioned here is more what I was talking about: https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/

NEON is for ARM. AMD uses SSE/AVX.

+1 for @RandyGaul's comment about ring buffers. I will be exposing a ring buffer API in mini_al at some point.

BareRose commented 6 years ago

I saw cute_sound when I originally searched for a mixer, totally forgot it had SSE2. Main thing I don't like about it is that it seems to have its backends baked in, I'd much prefer a pure mixer so I can use mini_al for backend stuff. In particular having to ship SDL with my linux builds would kinda defeat the purpose of using a lightweight sound library. I suppose I could strip it down and turn it into a backend-agnostic pure mixer, unless you already have something like that that I somehow missed.

Those branchless tricks rely heavily on SSE4.x right? How widely supported are those? They've been around for ages so I suppose they're fine.

mackron commented 6 years ago

Those branchless tricks rely heavily on SSE4.x right?

Not really. In the Branchless “select” (cond ? a : b) section, he has examples that work on all versions of SSE, and then examples of SSE4.1+ which do the same thing but are more efficient.

Keep in mind that this branchless stuff may not actually make a huge amount of difference in practice. It's just something to experiment with that may speed things up a little bit. Indeed, depending on your requirements, you might be able to avoid clipping altogether.

You mentioned a lock-free ringbuffer before. I assume you'd implement some kind of command pattern with these where calls to say, a playSound function would post a message saying "play this sound with this gain and pan" to the ringbuffer/queue-thing, then the audio thread consumes these messages in its callbacks and SIMD's together all the different samples and such?

In the context of audio, when we speak of a ring buffer, we're not talking about a message queue - it's a buffer containing raw audio data. The idea is to have a ring buffer containing your audio data sitting in between your mixer and your device:

Mixer -> [Ring Buffer] -> Device

The mixer writes audio data into the ring buffer at an offset called the "write pointer", and the device reads audio data from the ring buffer at an offset called the "read pointer". The write pointer is always just ahead of the read pointer, far enough ahead that mixer has enough time to do it's work before the read pointer has a chance to catch up. When the mixer writes to the ring buffer it increases the offset of the write pointer. Likewise, when the device reads from the ring buffer it increases the offset of the read pointer. When the pointers reach the end of the buffer, they loop back to the start (hence the name "ring" buffer). With this design, only a single entity is writing the ring buffer (the mixer) and a single entity is reading from it (the device). Since reading never overlaps with writing (remember, the write pointer is always ahead of the read pointer), and only a single entity is reading, and only a single entity is writing, you can get a completely lock-free data delivery system to/from the device. (Note that in the scenario I described above I've done it from the perspective of playback. When capturing, you'd just reverse your perspective.)

With mini_al, the device's audio thread periodically fires a callback to give your program a chance to deliver data to the device (or receive it in the case of capture). This is where you would read from the ring buffer. Where you do your mixing work depends on your requirements and architecture - you could do it in the same thread as the callback, but then you'd need to ensure you never lock (you should avoid locking in the audio thread). You could do mixing in a separate thread, but you still need to ensure you do it quick enough that the read pointer in your ring buffer never catches up. This is where it's gets complicated and where a good quality mixer becomes valuable and the space I'm hoping to fill with mini_al at some point. This then brings me to your next question:

What particular multithreading primitives/patterns would you recommend for a mixer?

Assess your requirements first, then decide. There's different types of mixers, from simple summing/layering, to 3D positioned voices in games, to hierarchical submixing with special effects, etc. I'm still deciding myself which of these spaces I want to prioritize with mini_al and what the API will look like so I don't really have a recommendation.

Hopefully I explained all of that clearly enough... :)

RandyGaul commented 6 years ago

The mixing in cute_sound is independent of any backends. It’s pulling in bytes from all live sounds as needed, and funnels bytes back to a ring buffer after mixing.

Should be a good reference for making a pure mixer.

BareRose commented 6 years ago

With the message queue I meant for communication between the main thread and the mixer thread (where the callback happens). Thanks for the explanation anyway.

Assess your requirements first, then decide.

All I need (for now) is basic summin/layering with gain/pan. Basic gain/pan can do "fake" 3D sound by calculating the correct pan/gain based on listener and emitter positions at playback start, should be close enough for short sounds, especially if listener and emitters don't move much.

Should be a good reference for making a pure mixer.

I suppose having cute_sound for reference couldn't hurt.

BareRose commented 6 years ago

Quick question/suggestion that came up while working on my mixer: Are the buffers passed to the send/receive callbacks always 16 byte aligned? If this is already the case you could document that as a guarantee so people building SIMD-accelerated mixers atop mini_al can safely write/read directly to/from these buffers with aligned SIMD instructions.

mackron commented 6 years ago

No, unfortunately they are not always guaranteed to be aligned because some backends provide pointers to their internal buffers themselves which mini_al passes to the callbacks directly.

BareRose commented 6 years ago

Ah, that's unfortunate. What would you recommend for writing/reading to/from the buffers instead? Unaligned SIMDs or memcpy/move from aligned memory? One method would be to find the first aligned address in the buffer, do aligned SIMDs from there, then memmove only if the buffer was actually misaligned. Question is if memove is even faster than unaligned SIMDs, it very well might be when there are sufficiently many layers (and thus many unaligned stores).

Also when you say "don't lock in the audio thread" you mean the thread calling the send/receive callbacks, right? Based on your explanation of the ringbuffer between mini_al's worker thread and the backend device it should be fine to lock the thread so long as there's still data in the ring buffer, cute_sound locks its thread and that seems to work fine.

mackron commented 6 years ago

I usually just use memcpy() for simple data movement. If you need to do arithmetic then I would do a SIMD optimized implementation when the buffers are aligned and then a scalar implementation for unaligned. You can also do unaligned loads with SSE and AVX, but I've never used them (yet) and I don't know how much overhead they have.

The no locking thing I'm talking about is in the thread that fires the callback. It's just general guidance - it's not the end of the world if you do short locks that are in your control, but it's bad if you do things like file IO and memory allocations where you never know how long the operating system will wait on it's internal locks. Indeed, the work I did in raylib uses a lock in the callback thread and nobody's complained yet, but it's not ideal and I'm going to try and see if I can replace that with a lock-free solution soon.

BareRose commented 6 years ago

I've never used them (yet) and I don't know how much overhead they have.

Based on my research unaligned instructions are always slow only on fairly old systems, medium-old systems tend to have a fast path so that the unaligned instructions only incur overhead if they're actually given an unaligned address, and newer systems have this fast path along with fairly low overhead even when it isn't taken.

I'm going to try and see if I can replace that with a lock-free solution soon.

One possible lock-free design I came up with is to just use stdatomic and a bunch of CAS loops, but only in outside threads (meaning threads besides the mixer thread). This way the mixer thread can never end up "stuck" on a CAS loop and is wait-free, whereas the outside threads are mostly lock-free - except when there's sufficient contention for the CAS loops to degrade into what might as well be locks. Having to rely on atomics does make certain things more complicated, of course.