orottier / web-audio-api-rs

A Rust implementation of the Web Audio API, for use in non-browser contexts
https://docs.rs/web-audio-api/
MIT License
286 stars 16 forks source link

AnalyzerNode API #253

Open b-ma opened 1 year ago

b-ma commented 1 year ago

Hey,

Trying to wrap the Analyzer I just ran on the fact that get_float_frequency_data and get_float_time_domain_data are both waiting for a Vec while it seems to me that everywhere else the spec declare a Float32Array we used a &mut [f32], cf. AudioBuffer::copy_from_channel for example.

Is there any reason for that ? I don't really see why we couldn't use the same API here

orottier commented 1 year ago

The reason is that I think the compiler will not allow you to ship the reference to another thread. I might be wrong though, I could have another look. Using unsafe, it will definitely be possible. Cast to a pointer, ship it to the thread and await a signal that the pointer was written to.

All in all I'm not sure if the current system is the best way to do it. It is now implemented as a ping-pong from the node to the processor. Whenever you request time/freq data, a message is sent to the render thread. The render thread fills the buffer in the next render quantum and ships it back. It does avoid any allocations and needless copies, so it is very efficient from the perspective of the render thread. However, you cannot make more than 1 call per render quantum (otherwise it will block the control thread), and also the control thread might block for up to a millisecond to wait for the next render quantum

b-ma commented 1 year ago

Yup I see,

I just had a look to the chromium source code and all analysis seems to be performed in the control thread (which really makes sens), see the DCHECK(IsMainThread()); check here

Maybe we could kind of reverse the logic by doing something like this?

  1. render thread downmix input
  2. render thread somehow makes a copy of the downmix and send it to the control thread
  3. control thread takes care of buffering, etc.
  4. control thread performs analysis on demand with latest received values

This way I guess we could avoid both the unsafe and Mutex problems

I'll try to have a shot on this, see where it goes :)

edit: ... even if I'm realizing maybe this is not doable without memory allocation or the unsafe trick you used edit2: ...arg... really not that simple... edit3: ...maybe some kind of lock free queue would work (I think this actually what Chromium does, from what I understand considering my really poor C++ skills...)

orottier commented 1 year ago

Your plan sounds alright, let's try to work it out later

b-ma commented 1 year ago

I managed to make a small prototype of a kind of lock free ring buffer that I think is thread safe. This is quite low-level and finally rely on unsafe too, but I think it could work and the AnalyserNode would be mostly free (only some memcopy) from an audio thread perspective:

use std::ptr;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};

const RING_BUFFER_SIZE: usize = 65536; // MAX_FFT_SIZE * 2
const QUANTUM_SIZE: usize = 128;

struct Analyser {
    // keeping this around seems to prevent memory to be released while we use the ptr
    buffer: Arc<[f32; RING_BUFFER_SIZE]>,
    buffer_ptr: *mut f32,
    index: AtomicUsize,
}

// @todo (?) - impl drop to drop the pointer manually

impl Analyser {
    pub fn new() -> Self {
        let mut buffer = [0.; RING_BUFFER_SIZE];
        let buffer_ptr = buffer.as_mut_ptr();

        Self {
            buffer: Arc::new(buffer),
            buffer_ptr: buffer_ptr,
            index: AtomicUsize::new(0),
        }
    }

    // this runs in the audio thread
    pub fn add_input(&self, src: &[f32]) {
        let mut index = self.index.load(Ordering::SeqCst);
        let len = src.len();

        // push src data in ring bufer
        if index + len > RING_BUFFER_SIZE {
            // in our test conditions we can't be there yet
        } else {
            // we have enough room to copy src in one shot
            unsafe {
                let src_ptr = src.as_ptr();
                let dst_ptr = self.buffer_ptr.add(index);

                ptr::copy_nonoverlapping(src_ptr, dst_ptr, len);
            }
        }

        index += len;

        if index >= RING_BUFFER_SIZE {
            index -= RING_BUFFER_SIZE;
        }

        self.index.store(index, Ordering::SeqCst);
    }

    // if we read only below index in control thread we are sure the memory is clean
}

fn main() {
    let analyser = Analyser::new();

    for _ in 0..2 {
        for i in 0..(RING_BUFFER_SIZE / QUANTUM_SIZE) {
            let data = [i as f32; QUANTUM_SIZE];

            analyser.add_input(&data);

            println!("{:?}", analyser.index.load(Ordering::SeqCst));
        }
    }
}

What do you think, should we try to continue this way or do you see something wrong I didn't catch ?

b-ma commented 1 year ago

Hum, actually it doesn't seem to work well, copied values are garbage. That's weird it was working well when doing exactly the same thing without the struct / Arc thing, I will investigate

b-ma commented 1 year ago

I think I maybe found the problem (inspired from https://github.com/utaal/spsc-bip-buffer/blob/master/src/lib.rs#L89), using a Box seems to be the solution so that the pointer stays clean:

    pub fn new() -> Self {
        // inspired from https://github.com/utaal/spsc-bip-buffer/blob/master/src/lib.rs#L89
        // allocated in the stack but done in the control thread
        let mut buffer = Box::new([0.; RING_BUFFER_SIZE]);
        let buffer_ptr = buffer.as_mut_ptr();

        Self {
            buffer,
            buffer_ptr,
            index: AtomicUsize::new(0),
        }
    }

Quite a funny thing

b-ma commented 1 year ago

Ok, ended up with that: https://gist.github.com/b-ma/a0909191089037b9cbebc2f7bd1c8117, which I think should work quite well in our case. From the unit tests I have made, I really don't see what could go wrong as the logic behind is finally quite simple, but maybe I miss something.

Did it in some dummy lib project to really focus on the problem, but from that point I really think adapting the Analyser should be quite straightforward

orottier commented 1 year ago

Hey @b-ma I am very sorry to ruin your party. But reading and writing to the same memory location concurrently is undefined behaviour.. :(

This is what miri has to say:

running 6 tests
test tests::test_add_input_aligned ... error: Undefined Behavior: attempting a write access using <183428> at alloc72886[0x0], but that tag does not exist in the borrow stack for this location
   --> src/lib.rs:82:17
    |
82  |                 ptr::copy_nonoverlapping(src_ptr, dst_ptr, len);
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |                 |
    |                 attempting a write access using <183428> at alloc72886[0x0], but that tag does not exist in the borrow stack for this location
    |                 this error occurs as part of an access at alloc72886[0x0..0x200]
    |
    = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
    = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
help: <183428> was created by a SharedReadWrite retag at offsets [0x0..0x1000]

I'm sharing your gut feeling that writing to a static location of f32 values should be safe on 64bit architectures, but the compiler says no. Compiler developers call this: there is no such thing as a benign data race. You must use some form of synchronization. An Arc<[AtomicF32]> with Relaxed memory ordering could do and should have reasonable performance characteristics.

But still then, I am not entirely convinced your example will work. There is no reliable way to read the full buffer without risking an intermediate write, and this will result in a garbage FFT.

You could maybe check if there are any crates that attempt to solve this

orottier commented 1 year ago

Please note however, I think you are diving in a very cool domain here. We should explore further. The red line I want to draw is that I want either: a solution in safe rust, or use a crate written by domain experts. We can use your current implementation as a performance baseline.

A safe implementation I can imagine is this:

b-ma commented 1 year ago

Huhu, no problem I can understand your concerns

But (yes there is a but), I'm still convinced it works (or at least it can / should, whatever idiot compilers say :) ) because there is no possible way you are reading the memory location that you are currently writing even if both processes do it concurrently:

So, from a strictly logical point of view, I really don't see where there could be any problem with corrupted data (except if copy_nonoverlapping writes somewhere we didn't ask to...)

For information, I inspired from these post and code:

(what is miri (seems like an even more annoying clippy) ? I just updated my rust, I still have no error on compiling and the tests in the gist are still passing on my side)


Just seen your new post, so I continue:

I understand and agree with your concerns that I'm playing with weird stuff that I'm not fully understand here, and that more hardcore low-level expertise would be welcome :)

On the algorithm you describe, almost everything is there (except obviously the first point). The only small other differences I can see are:

In any case, I'm perfectly ok to continue on the "safe" Arc<[AtomicF32]> path for now, as I really think the general design of the node would greatly improve. Then, it will be very easy to iterate later on the buffering strategy to improve performances, as it is very well circumscribed.

(I will ask later to more low-level (C, C++) colleagues what they think about the unsafe way just to have more idea on what could go wrong with the implementation I proposed.)

b-ma commented 1 year ago

Hey, I asked a colleague (who already did this kind of stuff in C/C++) about the strategy I proposed for the lock free ring buffer and few things to add to the discussion:

Maybe, we could ask p Adenot for its insight too

orottier commented 1 year ago

Thanks for sharing your new insights. Interesting stuff.

Let me get very straigth: the 'safe' version with Arc<[AtomicF32]> can still suffer from a corrupt buffer. Indeed, when the FFT thread is somehow stalled for a second, even the 'safe' version will happily trash the buffer you are currently reading, which will blip the FFT display.

What the safe version does guard against is undefined behaviour. Which is a thing we should avoid at all cost. Also, you cannot statistically make undefined behaviour go away. Unlikely undef is still undef. Also, rust targets 16-bit architectures so we should probably not make any assumptions about 'probably safe'.

Let's measure performance with the unsafe version though! I hope relaxed memory access is very performant!

b-ma commented 1 year ago

ok, you get the point: 1. benches are important to know what we are talking about and 2. reading this is important to know what we are talking about :)

orottier commented 1 year ago

The safe version is merged now. Also I added a small render thread CI benchmark for the AnalyserNode which will aid further measurements.

Some more reading for our interest:

Which is what I used for my standpoint "don't go into the UB territory" :)

b-ma commented 1 year ago

The safe version is merged now. Also I added a small render thread CI benchmark for the AnalyserNode which will aid further measurements.

Nice!