Threading and overrun/underrun

JasonBrownDeveloper commented 6 months ago

This PR isn't meant to be taken as is. It's to generate a conversation about architecture. Currently the MCU and LCD main loops are both wholly constrained by the work thread mutex. So there is no threading benefit as neither can run while the other has the mutex. I haven't deep dived the LCD loop yet, but in this PR, I've done a more classical approach to updating visuals on a 60 fps time scale.

Also the read/write pointers aren't managed and the current setup only works if you generate exactly as many samples as you need for the callback each time. That doesn't work on my machine and probably a lot of others. This PR reports the over/under runs but the output is very crackly as there's not enough samples to fill the time.

jcmoyer commented 6 months ago

Currently the MCU and LCD main loops are both wholly constrained by the work thread mutex. So there is no threading benefit as neither can run while the other has the mutex.

My understanding of the current implementation is that the MCU tries to fill the sample buffer until it's full. After that, there's no work to do so it unlocks and yields so the main thread can render the LCD. I don't think you can remove the mutex because the MCU writes LCD state, and if it's not synchronized you might observe invalid state while rendering.

Also the read/write pointers aren't managed and the current setup only works if you generate exactly as many samples as you need for the callback each time. That doesn't work on my machine and probably a lot of others.

I think there are two bugs in the current code.

First, it doesn't distinguish between an empty buffer and a full buffer. In a proper ringbuffer the write head must always be ahead of the read head. When the program starts, the buffer is empty (sample_write_ptr == sample_read_ptr). The first pointer that can move is the sample_read_ptr because it is unconditionally advanced in the audio callback. The work thread cannot reach the code that starts advancing the write pointer until this condition becomes false. So from the very start, the ringbuffer invariant is broken because the read head moves past the write head.

I assume the author meant for the MCU to unlock here when the buffer is full because it doesn't make sense to stop running the emulator when the buffer is empty - the audio thread is starved for samples. I tried to fix this in #68 by instead checking if the ringbuffer is full using sample_write_ptr + K == sample_read_ptr where K is some amount (1 or 2 audio frames) that accounts for oversampling. The work thread yields when the buffer is full, and the audio thread doesn't advance the read pointer unless there is data to read. This works out because the emulator fills the buffer faster than the audio thread can drain it.

The second bug is that sample_write_ptr and sample_read_ptr aren't atomically updated. I think it's possible for both the work thread and audio thread to observe outdated values - how this manifests I'm not yet sure because even without atomic ops it sounds fine on my end. My guess is that the effective audio buffer size is slightly smaller than it should be, and it's periodically corrected at least on x86 by sheer luck (the optimizer didn't remove stores/loads, and x86 happens to have guarantees that stores will eventually be observable).

JasonBrownDeveloper commented 6 months ago

My understanding of the current implementation is that the MCU tries to fill the sample buffer until it's full. After that, there's no work to do so it unlocks and yields so the main thread can render the LCD. I don't think you can remove the mutex because the MCU writes LCD state, and if it's not synchronized you might observe invalid state while rendering.

That’s also how I inferred the intended behavior should be. But again that makes one thread dependent on the other. Instead of releasing the mutex in the one place where the buffer is full, you could instead just call the LCD update function there. It the desire is to have separate logical threads to model the hardware, I would suggest maybe a co-thread instead, a la bsnes.

the audio thread is starved for samples. I tried to fix this in https://github.com/nukeykt/Nuked-SC55/pull/68 by instead checking if the ringbuffer is full using sample_write_ptr + K == sample_read_ptr where K is some amount (1 or 2 audio frames) that accounts for oversampling. The work thread yields when the buffer is full, and the audio thread doesn't advance the read pointer unless there is data to read.

For whatever reason on my machine nuked doesn’t generate near enough samples per callback to fill the time, an underrun. So I observe that the read pointer marches forward at a steady pace, sometimes lapping the write pointer in the buffer. Because the write pointer is so far behind the read pointer those samples aren’t heard until the read pointer wraps around to vaild data again. The experience is a chugging in sound output as I get a stretch of good data then 0ed data. With my patch I’m only feeding the callback as much data as is available so the read pointer can never pass the write pointer. The experience is now a crackling sound as the samples are being played as they're available.

nukeykt / Nuked-SC55

Threading and overrun/underrun #72