serenity-rs / songbird

An async Rust library for the Discord voice API
ISC License
391 stars 111 forks source link

Potential bug when writing to `opus_slot` in `mix_symph_indiv` for Opus Passthrough. #230

Closed Pranav-Badrinathan closed 7 months ago

Pranav-Badrinathan commented 7 months ago

Songbird version: (version) Songbird 0.4.1

Rust version (rustc -V): (version) rustc 1.75.0 (82e1608df 2023-12-21)

Serenity/Twilight version: (version) Serenity 0.12.1

Output of ffmpeg -version, yt-dlp --version (if relevant): ...

Description: If this is not a bug, I apologise for the inconvenience. I tried my best to find something wrong in my code but could not find anything.

My use case for this crate is to act as a bridge between streamed sound data and discord. After help at #223, I implemented a custom ReadOnlySource implementation called OpusStream, as FelixMcFelix outlined here. This allows me to play the track once, and send in music frames via an mpsc channel whenever I receive the data.

The implementation works for the first few seconds, and the sound is very nice and smooth. However, it always panics at the exact specific moment in the song each time.

The error it outputs is: Bounds check performed, and failure will block passthrough.: Error { kind: WriteZero, message: "failed to write whole buffer" }

I tried finding it found the error occurs in mix_symph_indiv, when attempting to write the buffer with the audio data to the opus_slot slice. https://github.com/serenity-rs/songbird/blob/8f60cf573e1f49ba82d104273a26bee960fc1435/src/driver/tasks/mixer/mix_logic.rs#L71-L88

After looking into the error, I found that the write_all implementation for a slice returns this particular error if the size of the buffer written is not equal to the size of the buffer provided to be written. In other words, if the buffer is only partially written.

fn write_all(&mut self, data: &[u8]) -> io::Result<()> {
    if self.write(data)? == data.len() {
        Ok(())
    } else {
        Err(io::const_io_error!(ErrorKind::WriteZero, "failed to write whole buffer"))
    }
}

[self.write]() here chooses the smallest between self.len() and data.len(), and returns this to write_all.

fn write(&mut self, data: &[u8]) -> io::Result<usize> {
    let amt = cmp::min(data.len(), self.len());
    let (a, b) = mem::take(self).split_at_mut(amt);
    a.copy_from_slice(&data[..amt]);
    *self = b;
    Ok(amt)
}

In my project, I have made sure to provide the correct frame sizes according to the DCA spec and I am making sure to pass in the data in 20ms frames, so the playback properly works for the first few seconds. So I am led to believe that the size of opus_slot could be smaller than my passed in data slice's size (which approximates to 250 each time), which causes this error.

This opus_slot, as far as I can comes from here: https://github.com/serenity-rs/songbird/blob/087e5f2292b2c82249f579e1286a1c985654222d/src/driver/tasks/mixer/mod.rs#L749-L750

Steps to reproduce: Here is the relevant section in my code. I do not know what else I can provide here, but please do message if you need anything.

Pranav-Badrinathan commented 7 months ago

I just did some testing using different pieces of music to see if whether that would have any impact (i.e different sizes for each 20ms frame, unnatural spikes in packet size that could be causing this, etc.), and did not find any. I tried a few songs, and they all were around 250 bytes in size for 20ms, and the error was happening very consistently after the same amount of time for all the songs.

FelixMcFelix commented 7 months ago

I think I can guarantee we always have more than 250B; the packet buffer slots are all at least 1KiB. Can you insert a print or breakpoint here and find out what the slice lengths actually are believed to be?

FelixMcFelix commented 7 months ago

https://github.com/serenity-rs/songbird/blob/8f60cf573e1f49ba82d104273a26bee960fc1435/src/constants.rs#L59

We (should) have 1460B per call per packet total. You have a bit less once you've paid for RTP headers + encryption tag storage, but this is all less than ~24B.

Pranav-Badrinathan commented 7 months ago

I think I can guarantee we always have more than 250B; the packet buffer slots are all at least 1KiB. Can you insert a print or breakpoint here and find out what the slice lengths actually are believed to be?

As in the slot length? I can do that. I'll get back with that info in a bit

We (should) have 1460B per call per packet total. You have a bit less once you've paid for RTP headers + encryption tag storage, but this is all less than ~24B.

This makes far more sense. Perhaps the vec is not being reserved with_capacity or with something that reserves that memory? Seems unlikely, but I have not found where it sets the slot size in code so can't comment on that.

FelixMcFelix commented 7 months ago

I think I can guarantee we always have more than 250B; the packet buffer slots are all at least 1KiB. Can you insert a print or breakpoint here and find out what the slice lengths actually are believed to be?

As in the slot length? I can do that. I'll get back with that info in a bit

I think ideally we want to see both the slot length and what songbird believes the length of your packet is. :)

Pranav-Badrinathan commented 7 months ago
Frame size info... 248
Actual data... 248
SLOT LEN: 1432, BUF LEN: 14089
thread '<unnamed>' panicked at C:\Users\prana.LAPTOP-MODAKBGP\.cargo\registry\src\index.crates.io-6f17d22bba15001f\songbird-0.4.1\src\driver\tasks\mixer\mix_logic.rs:88:30: Bounds check performed, and failure will block passthrough.: Error { kind: WriteZero, message: "failed to write whole buffer" }

Alright, it is not the slot. I added a print statement and recompiled songbird, and it is definitely the buffer. For whatever reason, the buffer is huge in size here. The "Frame size info ..." here refers to the first 2 bytes in a DCA frame that describe the size to expect. So on my Read implementation, it has detected 248. The "Actual data ..." is the length of the data returned to symphonia, again detected 248. It's bizarre that the buf has such a massive length here... I'll look into this and update.

Pranav-Badrinathan commented 7 months ago

Okay, I found something interesting. The problem is not with Songbird, as the error above shows. It might be with symphonia... It seems that Symphonia's Read implementation for MediaSourceStream does not fully read the buffer. For a buffer size of 248, there is one iteration where it only reads 182. This leaves 66 in the buffer, which it tries to reuse for the subsequent few reads, which eventually causes an error and thread panic.

The error above happens if I do NOT restart the bot but try again by starting another track and playing the music. Potentially similar to the first panic, perhaps due to reuse of the half-consumed buffer, the unused buffers start piling up internally till it finally takes the new buffer's entire collected data together? I am unsure if songbird is the cause here, so I'll leave this issue open. I'll keep this thread updated with anything I find.

Pranav-Badrinathan commented 7 months ago

Ok, finally managed to trip the system stepping through the debugger. Here is what I found:

  1. There is a bug in the opus passthrough in songbird. buf_size_fatal is not set and used properly: https://github.com/serenity-rs/songbird/blob/8f60cf573e1f49ba82d104273a26bee960fc1435/src/driver/tasks/mixer/mix_logic.rs#L80-L85 As you can see, buf_size_fatal is true only when the buffer is less than or equal in size to the slot. This should be reversed, it should be fatal if the buffer size is greater than or equal to the slot size. This is sent into record_and_check_passthrough_strike_final (heck of a name lol). https://github.com/serenity-rs/songbird/blob/8f60cf573e1f49ba82d104273a26bee960fc1435/src/driver/tasks/mixer/state.rs#L70-L77 Here, passthrough is blocked if fatal is true or if the packet has accrued too many strikes. Notice it expects fatal to be true if passthrough needs to be blocked, and buf_size fatal is false if passthrough needs to be blocked.

  2. Symphonia reads opus frames only partially if it reaches the end of its internal ringbuffer. Its ringbuffer has a max size of 65535 (u16 max), after which it wraps around and starts reusing from index 0. The point I made in the comment above, about Symphonia not reading the full 248 byte buffer is because it was discarding anything that went over it's ringbuffer limit. This throws the entire thing out of sync as the code thinks this frame has been fully read and moves on to process the next DCA frame length. It takes these 2 bytes, however, from the previously half read frame. This number can be anything, there is no guarantee, and is returned to songbird as the frame length passed in by my read implementation. This corrupted (?) size is used by the DCA format reader to request the next frame, and the size is usually far too large to fit in a packet.

I think using mp4 or PCM or any other format that does not require frame times to be reported like DCA, symphonia's system would work just fine. Songbird probably do this for most of its other supported formats, and this ringbuffer only needs to handle decoded PCM, that is then encoded using audiopus, hence why this issue has not been brought up (I couldn't find one in issues yet, at least.)


On a side note something in here could perhaps explain why I was experiencing consistent stuttering in #223.

FelixMcFelix commented 7 months ago

As you can see, buf_size_fatal is true only when the buffer is less than or equal in size to the slot. This should be reversed, it should be fatal if the buffer size is greater than or equal to the slot size. This is sent into record_and_check_passthrough_strike_final (heck of a name lol).

Good catch, thanks!

  1. Symphonia reads opus frames only partially if it reaches the end of its internal ringbuffer.

Symphonia is allowed to do this, and you need to be able to handle straddled reads -- practically speaking, an impl Read should even be able to handle a caller who decides to read your contents in 1-byte chunks. It looks to me like you're not quite doing that and are assuming you get a sufficiently large buffer:

https://github.com/Pranav-Badrinathan/minstrel-bot/blob/98574a93b76d502c1c10094850d100e2d174da7c/src/bot.rs#L200-L202

You need to actually add size to the current position, and only remove the current frame once you know it has been fully read. The sketch code in #223 includes this.

    // else
    //   fill buf from current_frame
    //   update pos, chunk_pos
    //   unset current_frame if done (chunk_pos == current_frame.len() + 2) <----- Here

Subsequent reads then use (chunk_pos - 2) as an index to know where to resume reading.

Pranav-Badrinathan commented 7 months ago

You need to actually add size to the current position, and only remove the current frame once you know it has been fully read. The sketch code in https://github.com/serenity-rs/songbird/discussions/223 includes this.

This was what I was attempting to figure out after I realized what I needed to do, didn't quite get it yesterday. Thanks for the guidance. I'll close this issue now that any actual bugs/problems have been fixed. Thanks!