(de-)allocation in render thread

b-ma commented 1 year ago

Follow up on #353, plus few ideas:

[ ] Implement Drop to drop AudioBuffer and Vec cleanly in renderers (cf. Otto's comment)
- [ ] AudioBufferSourceRenderer::buffer
- [ ] ConvolverRenderer::buffer
- [ ] WaveshaperRenderer::curve
- [ ] BiquadFilter::{x1, x2, y1, y2} -> consider using ArrayVec instead?
- [ ] DelayRenderer::ring_buffer
- [ ] DynamicsCompressorRenderer::ring_buffer
- [ ] IirFilterRenderer::{norm_coefs, states} -> consider using ArrayVec instead, we can probably clamp eveything using MAX_CHANNELS and the max number of coefs (i.e. 20) ?
[x] Use ArrayVec instead of Vec in AudioParam (this is already a dependency so that's cool, and except for resize this is mostly a drop in) #363
[ ] Implement optimizations proposed there, especially the second one as most of the time params are not modulating (or are not modulated by) an incoming signal (ok, this one is a bit out of context... :)
[ ] Install https://github.com/Windfisch/rust-assert-no-alloc to make sure our render thread is clean

b-ma commented 12 months ago

Hey,

Following the discusion here: https://github.com/orottier/web-audio-api-rs/pull/353#issuecomment-1705553704 , I had a look on how we could reuse the garbage collector thread to drop AudioBuffers and co. outside render thread and here are my first impressions:

If we want to be able to give some sender toward the garbage collector to the AudioNodes, it seem to mean that somehow the audio context should be able to deliver it, so that we grab a clone of the sender when we create the renderers.
But with the current implement of the garbage collector, it does not seems possible to implement this pattern because llq is single producer single consumer channel, while we would need a multiple consumer single consumer channel.

What do you think? Should I try to proceed this way and replace the llq channel with a crossbeam-channel?

(aside, maybe we could even use raw std::mpsc instead of crossbeam, cf. https://github.com/rust-lang/rust/pull/93563)

orottier commented 12 months ago

Thanks for sharing your thoughts.

The reason we are using llq is because it offers a channel of unbounded size, without (de)allocating in the render thread. This is not possible with crossbeam. The catch is that with llq we need to ship a pre-allocated node beforehand.

We could try to expose a method in the RenderScope to get a hold of the sender. I will need to time to sketch this out though. It's not trivial.

And maybe the llq message type should be enum, instead of Box<dyn Any>.


enum Garbage {
     Message(Box<dyn Any>),
     AudioBuffer(AudioBuffer),
     // etc
}

b-ma commented 12 months ago

Ok, I see, probably it's best if I let you define the general architecture then

b-ma commented 10 months ago

I have an idea about deallocation which I have the impression could solve several problems with rather low overhead. I didn't write any code yet as maybe I missed something, so I prefer having your opinion before digging into it.

If we take for example the AudioBufferSourceNode and AudioBuffer (pseudo code):

pub fn set_buffer(&mut self, audio_buffer: AudioBuffer) {
    if self.buffer.is_some() {
        panic!("InvalidStateError - cannot assign buffer twice");
    }

    let audio_buffer = Arc::new(audio_buffer);

    self.buffer = Some(audio_buffer);
    // send a clone to the garbage collector and another one to audio thread
    self.registration.garbage_collector.send(Arc::clone(&audio_buffer));
    self.registration.post_message(Arc::clone(&audio_buffer));
}

Then in the garbage collector we would just maintain a list of Arc<AudioBuffer> and check at every loop if Arc::strong_count(&audio_buffer) == 1 to drop them safely, because we would know both the node and renderer have been dropped.

That would mostly imply to modify quite a bit how the garbage collector is created, i.e. in main thread rather than in audio thread. But then I have the impression it could be quite straightforward and could allow to handle several of our issues in a unified way.

What do you think?

cybersoulK commented 10 months ago

i am currently having a lot of loud pops, after about a few hundred source nodes connect and disconnect. is this issue related? (something that it's not getting cleaned up?)

i do clone the AudioBuffer, for each source node.

pop.mp3.zip (internal recording of the bullets sounds, but instead of pop sounds, there are many abrupt absence of audio)

orottier commented 10 months ago

That's possible, but I would reckon unlikely. An AudioBuffer consists of reference counted channel data (shared between control and render thread) so only a single deallocation would take place even for hundreds of nodes added/removed if they play the same sound. My best bet would be the quadratic behaviour of the node graph sorting algo, see also #129 I can try to make a reproducible example locally and do some profiling to be sure.

(By the way thanks for all your reports. As you notice our library still has some rough edges. All the feedback is welcome)

orottier commented 10 months ago

@cybersoulK I just ran a stress test locally and I have no issues running a few hundred AudioBufferSourceNodes in parallel. Perhaps you could open a separate issue with more details. What does the audio graph look like? Are these nodes panned, and with which settings? Do you use a single panner per node per buffer source or do you share the panner? Does the browser handle it well? What does AudioRenderCapacity report?

cybersoulK commented 10 months ago

I will test and report back tomorrow with all of the remaining info.

For now:

Web equivalent does well

One buffer per source node. (Cloned)

Source -> gain -> panner -> destination

(This panner is not shared with anything else)

Cpal backend

Tested on MacOs sonoma m1 pro. Will test windows tomorrow

orottier commented 10 months ago

I have an idea about deallocation which I have the impression could solve several problems with rather low overhead.

Thanks for the new suggestion. I had not considered such a thing (similar to the java GC). Could be interesting!

A few things to note though:

This only solves our GC issue for immutable objects on the heap (such as an AudioBuffer). Mutable objects (such as the audio processor, or the ring buffer of the delay node) still need another solution.
There may be performance overhead because of the additional level of indirection (stack is typically faster)
How do we limit the amount of CPU the garbage collector thread uses? If we scan continuously it may take too much effort. We could scan every N seconds, or use a CondVar to actively notify when the render thread drops.

Mostly due to item 1 I'm not directly convinced this will make our life easier, but feel free to experiment! In the meantime I will try to put a bit more effort in #368

orottier / web-audio-api-rs

(de-)allocation in render thread #359