Add run-ahead setting to reduce input lag

tedsteen commented 6 months ago

I was digging into input latency with my project and I found out that when I set the input in tetanes and then clock a frame the resulting frame does not reflect the input. It takes one frame before the input is reflected.

I realised this when I was running the emulator at like 1% speed and pressed a button only to see the effect 2 frames later.

I tried this in mesen and other emulators and found that an input set before a frame is immediately reflected in the rendered output.

Haven't dug into why yet, but wanted to make you aware.

I basically do this: *self.control_deck.joypad_mut(Player::One) = Joypad::signature((my_state).into()); self.control_deck.clock_frame()

lukexor commented 6 months ago

Looking into this a bit more - this is actually normal without doing anything extra. Mesen and other emulators have run-ahead features that help reduce input lag by advancing the frame clock ahead. I've not yet implemented run-ahead at all so I'll change this ticket to a feature request.

Also - for your example, you shouldn't be using 'Joypad::signature' at all for input. It's there to support FourScore https://www.nesdev.org/wiki/Four_player_adapters#Four_Score

Instead it should be:

self.control_deck.joypad_mut(Player::One).set_button(JoypadBtn::Down, true);

tedsteen commented 6 months ago

Looking into this a bit more - this is actually normal without doing anything extra. Mesen and other emulators have run-ahead features that help reduce input lag by advancing the frame clock ahead. I've not yet implemented run-ahead at all so I'll change this ticket to a feature request.

Also - for your example, you shouldn't be using 'Joypad::signature' at all for input. It's there to support FourScore https://www.nesdev.org/wiki/Four_player_adapters#Four_Score

Instead it should be:
self.control_deck.joypad_mut(Player::One).set_button(JoypadBtn::Down, true);

ok, that explains it then!

Yeah, I saw the API but my joypad state is already a byte so it would be a bit messy to map that to a bunch of set_button calls.

lukexor commented 6 months ago

Yeah, I saw the API but my joypad state is already a byte so it would be a bit messy to map that to a bunch of set_button calls.

I see! Well in that case I think I'll rename that method to from_bytes

tedsteen commented 6 months ago

A nice article on the subject https://bsnes.org/articles/input-run-ahead/

lukexor commented 6 months ago

Thanks! I'll check it out. Looking at how Mesen does it, I think I have all the pieces I need to do it as well so hopefully shouldn't be too hard.

lukexor commented 6 months ago

So I tried this out a bit, unfortunately, debug builds aren't fast enough to do 2x frames for a run-ahead of 1, let alone 2 or 3 so I have some optimizations to make to account for the extra frame time and to have it work correctly with other features like rewind/recording/playback and speed changes.

lukexor commented 6 months ago

I've pushed up a version that works well in TetaNES with release mode with 1 run ahead frame. I'm still working out how to enable users of tetanes-core to easily use run ahead without having to manually implement it, but it's a bit tricky due to the state rewinding.

My current idea is a clock_frame_ahead method on ControlDeck similar to clock_frame which takes a FnOnce closure - providing callers with the generated frame buffer and audio samples for that frame. An alternative would be like a clock_frame_ahead_into method that takes frame: &mut [u8] and samples: &mut [f32] parameters to copy data into. Of course there's also returning owned Vecs but allocations should be avoided.

The reason is because once the frame clock is finished, the future frame data gets discarded and rewound back to the frame where input was initially processed.

Separately I'm also exploring predictive component catch up techniques to reduce CPU pressure. Right now all components are clocked every CPU cycle with cycle accuracy enabled. Disabling it clocks every CPU instruction which can result in a 4-8% boost in performance though some games won't work at all correctly.

tedsteen commented 6 months ago

The clock_frame_ahead_into that takes buffers to copy data into sounds like a reasonable API to me.

An opportunity for optimisation that crossed my mind might be to internally do a similar thing where the buffers could be wrapped in an Option<...> and if there are no buffers you don't do the actual rendering which might save you some cycles when doing run-aheads?

I don't know if f.ex this function has any side-effects on the emulation, but if it doesn't you could skip that call entirely if there's no buffer provided (headless run-ahead clocking).

You could do the same thing with the APU and skip the mixing and downsampling etc as well. IIRC the downsampling is a bit compute heavy.

Exposing that all the way out to the end user would get rid of the internal buffers as well. The user would provide their own buffers and you would fill them up.

... Or simply set a flag "headless" in the apu and ppu before doing a headless run..

Edit: I did a quick test and was able to do emulation.run_ahead=4 without missing frames when skipping the audio mixing in headless mode.

lukexor commented 6 months ago

Great! Thanks for the feedback. Yeah I had initially tried passing down an is_run_ahead flag down to components and didn't notice any huge performance wins but i didn't actually benchmark it so I'll give it another go around.

lukexor commented 6 months ago

Surprisingly last time I ran numbers, audio processing was a miniscule amount of the CPU overhead. What really eats up the ~7ms avg frame time is render_pixel because it's running over 5 million times per frame (3x the clock rate for every dot). Cutting that out should definitely help but when I tried it I had some visual issues so I'll need to explore more.

tedsteen commented 6 months ago

Another idea: look into simd f.ex using https://github.com/arduano/simdeez

I've heard some sunshine stories around emulators and simd :)

Perhaps there should be a generic optimization issue? There are other ideas f.ex const generics for the memory that has known size, using #[inline] at strategic places etc.

lukexor commented 6 months ago

Thanks! I have simd on my list of things to explore. Re: const generics, I don't think it'll help much. Any memory that is small enough for the stack are already static arrays. I do need to come back through with #[inline] though. I removed a ton of them because it's the sort of optimization you only make when you profile and know it helps and I had several use of #[inline(always] which can actually result in worse performance. However, that was before I split off tetanes-core and Rust can't inline across crate boundaries without #[inline] so I need to re-add them on some methods.

I'm getting ready to push up the new run ahead methods. Unfortunately, I can't skip mixing audio because of how the filters process and accumulate samples. Also, the PPU frame rate is not directly tied to the audio output rate, so ignoring samples based on a frame boundary isn't accurate and so conditionally outputting the audio results in buzzing.

I will look into providing headless options though if the caller does not care about output. For the frame buffer, however, even a headless run would likely want the frame data (I imagine). Just clocking the emulation without any outputs seems fairly useless unless you're only trying to get to a certain frame with known inputs. The use case I'm thinking of is like an AI learning algorithm - it would need the frame output bytes to parse even if there's no rendering of the bytes going on.

During some initial testing I've found that disabling frame pixel rendering results in a 6-10% increase in performance while disabling audio mixing results in a 35-60% increase which was huge and led me to dive into improving it. I had decided to opt into a moving average filter instead of an identity filter that rusticnes used. Removing that was a 14% increase and I shaved off another 5% by removing some additional filter layers that, as far as I can tell, have no audible effect.

I've also been thinking about getting rid of the internal buffers, but there are too many features and benefits relying on them - for example - the zapper gun requires the pixel data for the currently rendering frame to get the pixel brightness where the mouse is aiming.

The other benefit of having internal buffers is better cache locality and batching of operations. Having to switch between fetching local PPU/APU memory and whatever outside buffer memory is provided could cause frequent cache misses. Not to mention that TetaNES has different buffer requirements than the emulation does. For example, the frame buffer used by TetaNES comes from a buffer pool of re-usable Vec allocations because the emulation is running on a separate thread than the renderer and due to the multi-sync, I'm not always copying the frame each time it's clocked, but only when there's an available buffer based on VSync. For the audio I'm maintaining a separate buffer that holds more than a single frame of audio the buffer and prevent underruns, so only having an API where the emulation copies data into a passed buffer is much more limiting.

With all that explained, here's what I've got planned:

#[inline] several methods that may see common use outside of tetanes-core
Remove some unnecessary APU filters to increase performance. If you notice any degradation in quality, lmk but I couldn't tell a difference in several games I tested.
Add a new headless_mode: HeadlessMode to ControlDeck::Config with an associated set_headless_mode method that allows you to select skipping either video or mixing or both.
Add a new ControlDeck::frame_buffer_into method if you want to bring your own frame buffer. This is an alternative to ControlDeck::frame_buffer which decodes the internal PPU buffer of Vec<u16> down to an array of u8 RGB pixels. It avoids an extra copy if you've already got a buffer.
Two new clock frame methods: ControlDeck::clock_frame_output, ControlDeck::clock_frame_into which provide different ways of getting a combined frame/audio output after clocking a frame and removes the need to manually clear audio samples after clocking.
Two new clock run ahead frame methods: ControlDeck::clock_frame_ahead and ControlDeck::clock_frame_ahead_into

tedsteen commented 6 months ago

Awesome :) I've been trying out the latest and it is working nicely!

One note, I don't want to make things more complicated, but I did do this: https://github.com/tedsteen/nes-bundler/pull/113/files#diff-dd9cc2129b9d4d7684f4d1dd9eac835bb46a0a202f49ad85400a39e9199a827fR108-R166

This is because I have my own way of filtering my video (I use my own palette). I don't expect the API to necessarily handle my needs like that but I wanted to make you aware of the use case. I tried to extend the tatenes-core VideoFilter with a custom variant where you could provide your own filter, but I ran into issues with serialization and cloning. So I went back and did a custom run-ahead in my own code.

lukexor commented 6 months ago

Great! I'll explore that use case some more but so you know, I added that palette into the pixelate filter in tetanes-core so you shouldn't need to do it yourself anymore. I like the idea of a custom filter so I'm gonna try some ideas there. I also want to be able to do filters entirely in shaders in the future so I'll need to handle that too someday.

tedsteen commented 6 months ago

Great! I'll explore that use case some more but so you know, I added that palette into the pixelate filter in tetanes-core so you shouldn't need to do it yourself anymore. I like the idea of a custom filter so I'm gonna try some ideas there. I also want to be able to do filters entirely in shaders in the future so I'll need to handle that too someday.

That's cool but I support providing your own palette https://github.com/tedsteen/nes-bundler/blob/master/config/README.md

This is absolutely not a problem tho. It works great :)

lukexor commented 6 months ago

Cool! I might look into adding support for custom palettes and filters!

lukexor / tetanes

Add run-ahead setting to reduce input lag #216