local_frame_advantage bigger than i8::MAX

cscorley commented 2 years ago

Describe the bug Calculations for quality reports use i8 and can be too small during stress testing.

To Reproduce Steps to reproduce the behavior:

Using https://github.com/cscorley/bevy_ggrs_rapier_example
Compile using release build
Setup a game and introduce aggressive network lag such that 127+ "Skipping a frame: PredictionThreshold" will be produced on the clients
Observe panic thread 'main' panicked at 'local_frame_advantage bigger than i8::MAX: TryFromIntError(())', C:\Users\cscorley\scoop\persist\rustup\.cargo\git\checkouts\ggrs-151f8bd9eb75dc5d\9e4a20a\src\network\protocol.rs:512:18

Expected behavior Non-panic, just keep skipping frames.

Screenshots I was testing a game with clumsy on Windows, using this configuration.

Desktop (please complete the following information):

OS: Windows 10
Version: main branch, 9e4a20a6317cfdcff9c352e6f2a37feb33576e54

Additional context

gschup commented 2 years ago

Thanks for your report!

While this would be easy to fix by going from i8 to i32 or comparable, I am very curious how this issue can arise. Usually, when adding advancing the gamestate without remote input, PredictionThreshold occurs to halt the user from advancing too much (error is thrown here).

This should mean that local_frame_advantage should never be higher than the max prediction threshold, which by default is 8. In your example, you left that threshold at 8.

At this point, I can only hypothesize, but maybe I somewhere did a mistake when casting the numerical types from unsigned to signed and there is a silent overflow due to going negative? I will check it out later this week, probably on or shortly before the weekend.

PS: Sounds great! 😄

gschup commented 2 years ago

Another thought: Since the clumsy settings aren't too extreme (I tried worse both with real connections and in clumsy), could it maybe be the case that one of the clients cannot keep up with the framerate? This could especially be the case with rapier and big physics things...

cscorley commented 2 years ago

You may be right about it being related to actual framerate and not network. I also have tested this by just laying down a sleep 3s (e.g. dumb cap to approximate 180 frames) in one of the client systems and it seems to just "break through" the PredictionThreshold (after 111 of them) and allow one client to continue rendering, all before the other even completes it's first frame. (I do not have any framerate limiting in bevy itself, just the FPS limit set by bevy_ggrs)

In one test, I was able to get it to produce some 'fun errors' on the clients (but I don't know which was actually sleeping, forgot to log that info)

client 1

thread 'main' panicked at 'view entity should exist: QueryDoesNotMatch(3v0)', C:\Users\cscorley\scoop\persist\rustup\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_core_pipeline-0.7.0\src\main_pass_2d.rs:45:14

client 2

thread 'main' panicked at 'assertion failed: frame_to_load != NULL_FRAME && frame_to_load < self.current_frame &&\n    frame_to_load >= self.current_frame - self.max_prediction as i32', C:\Users\cscorley\scoop\persist\rustup\.cargo\git\checkouts\ggrs-151f8bd9eb75dc5d\9e4a20a\src\sync_layer.rs:138:9

MaxCWhitehead commented 3 months ago

I have reproduced a couple of these while stepping through a ggrs test in debugger (so there are larger delays in wall clock time). This does not play nice with the code using wall clock time in ggrs, and can cause issues.

I most commonly hit thread 'main' panicked at 'assertion failed: frame_to_load != NULL_FRAME && frame_to_load < self.current_frame &&\n frame_to_load >= self.current_frame - self.max_prediction as i32', I believe this is due to a rollback being triggered for disconnect_frame 0 (disconnect frame chosen as first_incorrect), when current_frame is also 0.

When debugging I commented out the time check that triggers network interrupt events, and then after that I reproduced the assert in original post. Remote frame in the local frame advantage computation is based on an estimation with ping + framerate. It does not actually use the true remote frame, so when debugging or in very poor net conditions, it is possible for this to get very large and assert.

(Not sure if it's sound for this failure to occur considering I commented out disconnect code - but just sharing notes on that.)

MaxCWhitehead commented 3 months ago

You may be right about it being related to actual framerate and not network. I also have tested this by just laying down a sleep 3s (e.g. dumb cap to approximate 180 frames) in one of the client systems and it seems to just "break through" the PredictionThreshold (after 111 of them) and allow one client to continue rendering, all before the other even completes it's first frame. (I do not have any framerate limiting in bevy itself, just the FPS limit set by bevy_ggrs)

I would recommend confirming disconnect events are handled, I have a suspicion that may be the cause. Once client is disconnected GGRS will report this event, and continue to advance.

gschup / ggrs

local_frame_advantage bigger than i8::MAX #35