Graph::order_nodes bad performance for large number of nodes

orottier commented 2 years ago

The example/benchmark.rs "Granular synthesis" program has terrible performance. This program adds 1500 buffer source nodes with increasing start time and a gain envelope.

I ran some profiling, which shows time slices

58% Graph::order_nodes
29% AudioNode processing (17% audio params, 7% gain, 4% buffer source)
12% Graph processing overhead

Probably some terrible quadratic complexity is happening in the order_nodes function. (or maybe, we are ordering the nodes over and over again?) Next up: audio param profiling

(open image in new tab to get mouse hover info) flamegraph

b-ma commented 2 years ago

Hey, I did some (very) dirty logging and here are some impressions I have

(or maybe, we are ordering the nodes over and over again?)

I might be mistaking, but it seems to me that the ordering is done according the removal of the nodes from the graph after they finished their processing (which is de facto quite often in the granular bench), maybe a bit more than expected (I counted 1827 where I was waiting for 1500) but that could be due to the fact that related AudioBufferSource and GainNode are not always dropped at the exact same time. In any case it does not appear as if it was done on each tick or something.

But, I discovered something that could be a real problem if I'm not mistaking. I just dirty logged the number of nodes in Graph.order_nodes(), something like:

println!("{:?}", duration);
println!("{:?}", self.nodes.keys().len());
println!("{:?}", self.ordered.len());

which this gives me at the very beginning:

18.803119ms
7506
7506

while the number of nodes in the graph should be > num nodes 3002 (~ 1500 * (src + gain))

After some struggling, it just appeared to me that the AudioParams seems to be stored in Graph::nodes along with the AudioNodes, in which case these numbers seems more logical because GainNode has 1 param (gain) and AudioBufferSourceNode has 2 params (detune and playbackRate). So maybe we are just increasing the complexity the graph ordering by a factor of 2.5 for nothing here. Do you confirm this impression or do you think it's not that relevant regarding the implementation of the algorithm, I can't really figure out?

orottier commented 2 years ago

Thanks for digging in!

Indeed, AudioParams are treated just like AudioNodes, and are therefore part of the audio graph. Params share the same characteristics as nodes: they feed into other nodes, have internal processing state, emit 'samples', and other nodes can feed into them[1]. It is required for them to be part of the graph ordering.

18ms for sorting 7500 items is a long time! Definitely some quadratic complexity going on there, we should investigate.

But I now realize the main issue is probably that we re-sort the graph every time a node (and its params) is dropped. This is not necessary, because the topological order will never change[2] when a node disappears. Hence this code

        // If there were any nodes decomissioned, clear current graph order
        if nodes_dropped {
            self.ordered.clear();
        }

Should be changed to just selectively remove the dropped nodes from the ordered list (which is O(N), so fine).

[1] https://webaudio.github.io/web-audio-api/#dom-audionode-connect-destinationparam-output [2] I need to verify this, because a dropped node could remove a cycle and then the sorting does change. This is really an edge case though and probably not possible

orottier commented 2 years ago

I also realize we definitely need to look into 'inactive' audioparams. As you can see from the flame chart, AudioParam 'rendering' takes a lot of time. If we know an AudioParam only starts tweening at time T, can we short-circuit the results for t < T and not fill any buffers and stuff?

So in short, let's fix:

dropping a node should not cause a full sort
quadratic complexity of the sorting algo
optimize inactive audio params

My guess is that 1) already fixes the bad benchmark result

b-ma commented 2 years ago

ok, for AudioParams that was kind of my impression but better to be sure.

few remarks:

dropping a node should not cause a full sort

Yup indeed, that will make a huge difference in the bench because the graph will be sorted only once in that case. The bias would thus be become, compared to an online version, that nodes are only inserted once. But in any case this is really a good thing to do.

quadratic complexity of the sorting algo

I don't think this explain everything, but at first sight a simple thing to do would be to store marked in the nodes themselves, so we do need to go through an entire list of nodes to check that each time, i.e. replace

// Do not visit nodes multiple times
if marked.contains(&node_id) {
      return;
}

with something like:

// Do not visit nodes multiple times
if node.marked == true {
      return;
}

optimize inactive audio params

I don't think we can per se do that because, that would mean the AudioParam would have to know if its node is active or not for this render quantum, while the param must be rendered before its node. But I think we however have simple room for improvement relying on silent buffers:

a. bypass RenderQuantum::add if input is silent, i.e. if the param is not modulated by a signal (which is most of the case and 7,66% in the flamegraph)

Much more related to GainNode optimizations (but gains should be as cheap as possible IMO as you may have many many of them in complex routing):

b. return a silent buffer if param_intrisic_value is 0 and no automation is scheduled, so the gain node just return silence too
c. possibly introduce some "unit buffer" in the same fashion as the silence buffer (i.e. [1.; 128]) so the gain knows it can just forward it's input without modification

I will have a look on a. which is really simple to do, I think and try to advance on inactive audio nodes, for GainNode it really simple to do and improve a lot the bench too

orottier / web-audio-api-rs

Graph::order_nodes bad performance for large number of nodes #129