AnalyzerNode API - Githubissues

b-ma commented 1 year ago

Hey,

Trying to wrap the Analyzer I just ran on the fact that get_float_frequency_data and get_float_time_domain_data are both waiting for a Vec while it seems to me that everywhere else the spec declare a Float32Array we used a &mut [f32], cf. AudioBuffer::copy_from_channel for example.

Is there any reason for that ? I don't really see why we couldn't use the same API here

orottier commented 1 year ago

The reason is that I think the compiler will not allow you to ship the reference to another thread. I might be wrong though, I could have another look. Using unsafe, it will definitely be possible. Cast to a pointer, ship it to the thread and await a signal that the pointer was written to.

All in all I'm not sure if the current system is the best way to do it. It is now implemented as a ping-pong from the node to the processor. Whenever you request time/freq data, a message is sent to the render thread. The render thread fills the buffer in the next render quantum and ships it back. It does avoid any allocations and needless copies, so it is very efficient from the perspective of the render thread. However, you cannot make more than 1 call per render quantum (otherwise it will block the control thread), and also the control thread might block for up to a millisecond to wait for the next render quantum

b-ma commented 1 year ago

Yup I see,

I just had a look to the chromium source code and all analysis seems to be performed in the control thread (which really makes sens), see the DCHECK(IsMainThread()); check here

Maybe we could kind of reverse the logic by doing something like this?

render thread downmix input
render thread somehow makes a copy of the downmix and send it to the control thread
control thread takes care of buffering, etc.
control thread performs analysis on demand with latest received values

This way I guess we could avoid both the unsafe and Mutex problems

I'll try to have a shot on this, see where it goes :)

edit: ... even if I'm realizing maybe this is not doable without memory allocation or the unsafe trick you used edit2: ...arg... really not that simple... edit3: ...maybe some kind of lock free queue would work (I think this actually what Chromium does, from what I understand considering my really poor C++ skills...)

orottier commented 1 year ago

Your plan sounds alright, let's try to work it out later

b-ma commented 1 year ago

I managed to make a small prototype of a kind of lock free ring buffer that I think is thread safe. This is quite low-level and finally rely on unsafe too, but I think it could work and the AnalyserNode would be mostly free (only some memcopy) from an audio thread perspective:

use std::ptr;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};

const RING_BUFFER_SIZE: usize = 65536; // MAX_FFT_SIZE * 2
const QUANTUM_SIZE: usize = 128;

struct Analyser {
    // keeping this around seems to prevent memory to be released while we use the ptr
    buffer: Arc<[f32; RING_BUFFER_SIZE]>,
    buffer_ptr: *mut f32,
    index: AtomicUsize,
}

// @todo (?) - impl drop to drop the pointer manually

impl Analyser {
    pub fn new() -> Self {
        let mut buffer = [0.; RING_BUFFER_SIZE];
        let buffer_ptr = buffer.as_mut_ptr();

        Self {
            buffer: Arc::new(buffer),
            buffer_ptr: buffer_ptr,
            index: AtomicUsize::new(0),
        }
    }

    // this runs in the audio thread
    pub fn add_input(&self, src: &[f32]) {
        let mut index = self.index.load(Ordering::SeqCst);
        let len = src.len();

        // push src data in ring bufer
        if index + len > RING_BUFFER_SIZE {
            // in our test conditions we can't be there yet
        } else {
            // we have enough room to copy src in one shot
            unsafe {
                let src_ptr = src.as_ptr();
                let dst_ptr = self.buffer_ptr.add(index);

                ptr::copy_nonoverlapping(src_ptr, dst_ptr, len);
            }
        }

        index += len;

        if index >= RING_BUFFER_SIZE {
            index -= RING_BUFFER_SIZE;
        }

        self.index.store(index, Ordering::SeqCst);
    }

    // if we read only below index in control thread we are sure the memory is clean
}

fn main() {
    let analyser = Analyser::new();

    for _ in 0..2 {
        for i in 0..(RING_BUFFER_SIZE / QUANTUM_SIZE) {
            let data = [i as f32; QUANTUM_SIZE];

            analyser.add_input(&data);

            println!("{:?}", analyser.index.load(Ordering::SeqCst));
        }
    }
}

What do you think, should we try to continue this way or do you see something wrong I didn't catch ?

b-ma commented 1 year ago

Hum, actually it doesn't seem to work well, copied values are garbage. That's weird it was working well when doing exactly the same thing without the struct / Arc thing, I will investigate

b-ma commented 1 year ago

I think I maybe found the problem (inspired from https://github.com/utaal/spsc-bip-buffer/blob/master/src/lib.rs#L89), using a Box seems to be the solution so that the pointer stays clean:

    pub fn new() -> Self {
        // inspired from https://github.com/utaal/spsc-bip-buffer/blob/master/src/lib.rs#L89
        // allocated in the stack but done in the control thread
        let mut buffer = Box::new([0.; RING_BUFFER_SIZE]);
        let buffer_ptr = buffer.as_mut_ptr();

        Self {
            buffer,
            buffer_ptr,
            index: AtomicUsize::new(0),
        }
    }

Quite a funny thing

b-ma commented 1 year ago

Ok, ended up with that: https://gist.github.com/b-ma/a0909191089037b9cbebc2f7bd1c8117, which I think should work quite well in our case. From the unit tests I have made, I really don't see what could go wrong as the logic behind is finally quite simple, but maybe I miss something.

Did it in some dummy lib project to really focus on the problem, but from that point I really think adapting the Analyser should be quite straightforward

orottier commented 1 year ago

Hey @b-ma I am very sorry to ruin your party. But reading and writing to the same memory location concurrently is undefined behaviour.. :(

This is what miri has to say:

running 6 tests
test tests::test_add_input_aligned ... error: Undefined Behavior: attempting a write access using <183428> at alloc72886[0x0], but that tag does not exist in the borrow stack for this location
   --> src/lib.rs:82:17
    |
82  |                 ptr::copy_nonoverlapping(src_ptr, dst_ptr, len);
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |                 |
    |                 attempting a write access using <183428> at alloc72886[0x0], but that tag does not exist in the borrow stack for this location
    |                 this error occurs as part of an access at alloc72886[0x0..0x200]
    |
    = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
    = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
help: <183428> was created by a SharedReadWrite retag at offsets [0x0..0x1000]

I'm sharing your gut feeling that writing to a static location of f32 values should be safe on 64bit architectures, but the compiler says no. Compiler developers call this: there is no such thing as a benign data race. You must use some form of synchronization. An Arc<[AtomicF32]> with Relaxed memory ordering could do and should have reasonable performance characteristics.

But still then, I am not entirely convinced your example will work. There is no reliable way to read the full buffer without risking an intermediate write, and this will result in a garbage FFT.

You could maybe check if there are any crates that attempt to solve this

orottier commented 1 year ago

Please note however, I think you are diving in a very cool domain here. We should explore further. The red line I want to draw is that I want either: a solution in safe rust, or use a crate written by domain experts. We can use your current implementation as a performance baseline.

A safe implementation I can imagine is this:

use an Arc<[AtomicF32]> for the buffer. Make the buffer at least one render quantum size bigger than the max FFT size.
use an Arc<AtomicUsize> for the writer position
The audio threads writes to the buffer. It first writes the chunk, then updates the position value
The control threads will read the buffer like this:
1. read the writer position value.
2. read the required number of buffer values before this position
3. read the writer position again. If it has changed, go back to ii.

b-ma commented 1 year ago

Huhu, no problem I can understand your concerns

But (yes there is a but), I'm still convinced it works (or at least it can / should, whatever idiot compilers say :) ) because there is no possible way you are reading the memory location that you are currently writing even if both processes do it concurrently:

The buffer is max_fft_size * 2
The write index (which is Atomic) is updated only once the memory has been fully updated
When a read occurs (even concurrently) the data that is read is always before the index (which is still the old one if a write is on-going). And the circular buffer is far big enough (as the the maximum amount you can read is fft_size), so that you cannot try to read the memory that is currently written even going backward.

So, from a strictly logical point of view, I really don't see where there could be any problem with corrupted data (except if copy_nonoverlapping writes somewhere we didn't ask to...)

For information, I inspired from these post and code:

https://ferrous-systems.com/blog/lock-free-ring-buffer/ (interesting read regardless our current discussion)
https://github.com/utaal/spsc-bip-buffer (based on the previous post but looks too general purpose for our case)
Chromium source code (almost everything important is located in the realtime_analyser.cc file)

(what is miri (seems like an even more annoying clippy) ? I just updated my rust, I still have no error on compiling and the tests in the gist are still passing on my side)

Just seen your new post, so I continue:

I understand and agree with your concerns that I'm playing with weird stuff that I'm not fully understand here, and that more hardcore low-level expertise would be welcome :)

On the algorithm you describe, almost everything is there (except obviously the first point). The only small other differences I can see are:

Point 2 where the buffer I use is actually far bigger: I just took the Chromium value of max_fft_size * 2 here, but I agree max_fft_size + render_quantum_size should be enough. (... and I just realize that I commented this value to take a smaller one, but it was only for debugging / understanding purposes, cf. // const RING_BUFFER_SIZE: usize = 65536; // MAX_FFT_SIZE * 2)
Point 4.iii which in my opinion does not matter as I really don't see how to do precise things with the AnalyserNode. From my point of view this node is really meant to do visualization / auditing stuff (live spectrogramme, vu meter), so if you are one quantum "late" it's really no big deal (having in mind that screen fps is generally ~16ms). To do precise real-time analysis without having holes or overlap in the audio signal data (e.g. to control other processes in strict real-time), the only way I see is to use the ScriptProcessorNode (yup I know... but it is still pretty handy for that kind of stuff...) or plain full featured AudioWorklet (which is generally too complex to setup and implement, at least in prototyping phase)

In any case, I'm perfectly ok to continue on the "safe" Arc<[AtomicF32]> path for now, as I really think the general design of the node would greatly improve. Then, it will be very easy to iterate later on the buffering strategy to improve performances, as it is very well circumscribed.

(I will ask later to more low-level (C, C++) colleagues what they think about the unsafe way just to have more idea on what could go wrong with the implementation I proposed.)

b-ma commented 1 year ago

Hey, I asked a colleague (who already did this kind of stuff in C/C++) about the strategy I proposed for the lock free ring buffer and few things to add to the discussion:

From what I explained, he didn't see anything wrong with the strategy
I was actually wrong saying that there is no possible way to have corrupted data, there is actually one: if the read from the buffer is very (very) long we might end up reading data that is currently written by the audio thread. But as we have MAX_FFT_SIZE = 32768; additional room for the buffering (which represent almost 0.75s of audio) the possibility that it happens in real life should be very very low.
He acknowledged that there is no possible way to make such thing completely safe, we have to assume the risk that the data that we read can be polluted. A possible way to reduce further the risk would be to increase further the RING_BUFFER_SIZE (e.g. MAX_FFT_SIZE 3 or MAX_FFT_SIZE 4).
He was rather sceptical with the Arc<[AtomicF32]> strategy, because to make sure that the data that is read in the control thread are consistent, we just constantly spoil the audio thread. Or to make it closer to reality it was more something like "I don't care that the vu meter or any fancy display is wrong during a frame or two, I care that audio is always very fast..." :)

Maybe, we could ask p Adenot for its insight too

orottier commented 1 year ago

Thanks for sharing your new insights. Interesting stuff.

Let me get very straigth: the 'safe' version with Arc<[AtomicF32]> can still suffer from a corrupt buffer. Indeed, when the FFT thread is somehow stalled for a second, even the 'safe' version will happily trash the buffer you are currently reading, which will blip the FFT display.

What the safe version does guard against is undefined behaviour. Which is a thing we should avoid at all cost. Also, you cannot statistically make undefined behaviour go away. Unlikely undef is still undef. Also, rust targets 16-bit architectures so we should probably not make any assumptions about 'probably safe'.

Let's measure performance with the unsafe version though! I hope relaxed memory access is very performant!

b-ma commented 1 year ago

ok, you get the point: 1. benches are important to know what we are talking about and 2. reading this is important to know what we are talking about :)

orottier commented 1 year ago

The safe version is merged now. Also I added a small render thread CI benchmark for the AnalyserNode which will aid further measurements.

Some more reading for our interest:

Which is what I used for my standpoint "don't go into the UB territory" :)

b-ma commented 1 year ago

The safe version is merged now. Also I added a small render thread CI benchmark for the AnalyserNode which will aid further measurements.

Nice!

orottier / web-audio-api-rs

AnalyzerNode API #253