space-wizards / RobustToolbox

Robust multiplayer game engine, used by Space Station 14
https://spacestation14.io
Other
524 stars 395 forks source link

During ApplyGameState(...) client re-handles events already handled by server #4822

Open LordCarve opened 7 months ago

LordCarve commented 7 months ago

The problem

When applying a Server state to the Client (during Tick() -> ApplyState() -> ApplyGameState(GameState curState, GameState? nextState)) side effects of the changes being applied are triggered resulting in effective double-triggering. These side effects may result in client-side changes to the state, and as a result we end up with the dreaded: Server-side game state for tick X ≠ Client-side game state for tick X This is before prediction!

When such a client-side value change happens it leads to a hard network desync 100% of the time. The values are already ACK-ed and neither server not client will attempt to re-synchronize them. Until of course re-syncing of the component concerned is triggered by other means (dirtied/entity leaves and re-enters pvs/full state is sent/etc).

I'm not sure how wide in scope this problem is, but it concerns at least the SharedPhysicsSystem. I guess every side-effect caused by add/delete entity, add/delete component, add/delete component object are in the sights, which is a lot, so that's bad news.

Replication

  1. Call a method on client which a) modifies component values, and b) triggers component value modification side-effect [Example: ClimbSystem.StopClimb(...); component values modified: fixture collision masks; side-effect caused: DestroyFixture(...) destroys fixture and causes EndCollideEvent to be sent, which once again changes the same component values]
  2. Wait for server to sync. Server processes everything correctly (the above flow happens exactly once and server's component values are the expected ones)
  3. Wait for client to sync with server. [Client, when processing ApplyGameState(...) also removes the fixture (StateFrom has the fixture, StateTo does not) which triggers the side-effects (destruction of fixture causes a call of DestroyFixture(...)). Side effects change component values (from server's expected to server's expected + side-effects)
  4. Desync between Server game state for tick X and Client game state for tick X. [For Server DestroyFixture(...) happened once. For Client DestroyFixture(...) happened twice - first on server, then again on client.]

Ideas on how to fix

The game states sent by the server should be treated as absolute, that is, they already include all side-effects that were meant to happen (in fact they already did happen on the server-side, and the resulting game state always includes everything). Client should never trigger any side effects inside ApplyGameState(...). These values should be set raw and as-is.

Other notes

I can replicate it locally 100% of the time.

Screenshots depicting the desync: before (Before)

after (After)

Related server logs:

1. Registered StopClimb whose caller is TriggerCollisionMaskChange --- climbing state is: True; fixture mask is: 10, set to 30
2. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 30, set to 10

(Here we see the flow happening exactly once on the server. First line is the explicit call, second line is side-effect of DestroyFixture(...) caused by the first line.)

Related client logs:

1. Registered StopClimb whose caller is TriggerCollisionMaskChange --- climbing state is: True; fixture mask is: 10,  set to 30
2. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 30, set to 10
3. Registered StopClimb whose caller is TriggerCollisionMaskChange --- climbing state is: True; fixture mask is: 10,  set to 30
4. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 30, set to 10
...
23. Registered StopClimb whose caller is TriggerCollisionMaskChange --- climbing state is: True; fixture mask is: 10,  set to 30
24. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 30, set to 10
25. Registered StopClimb whose caller is TriggerCollisionMaskChange --- climbing state is: True; fixture mask is: 10,  set to 30
26. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 30, set to 10
27. Registered StopClimb whose caller is OnClimbEndCollide --- climbing state is: True; fixture mask is: 10, set to 30

(Here we see that prediction is correct. Every tick client is doing the "full flow" - explicit call + follow-up side effect. But on the sync tick [line 27.], it receives server state after side-effects (fixture mask 10), but also destroys the fixture which causes side-effect again on the Client-side. And the resulting desync is that Client-side fixture mask is now 30, while on Server-side it's 10).

metalgearsloth commented 7 months ago

Movement is entirely content-side and unrelated to the 2 mentioned issues.

LordCarve commented 7 months ago

My current best guess is it's related to Robust.Client.Physics/PredictSystem.Predict as the deepest I got so far was incorrect prediction of contacts' IsTouching on client side.

Prediction disabled this is not occurring (though this doesn't necessarily mean it's engine-side).

With your permission I'd like to keep this here, but I can move it to content issues if you prefer that.

To clarify - it seems physics related, not movement related.

The way it's related to space-wizards/space-station-14#7539 is that in both cases several concurrent predictions result in client state being desynced (in this issue CollisionMask - in that issue it's input). Somehow in reconciliation client is accepting predicted (or reverted-predicted) state as real state even without server's say so. This seems to be an engine problem (game states are not content), but it's hard to say without getting to the bottom of this.

LordCarve commented 7 months ago

With #4836 resolved I can successfully replicate the problem in a test environment and can continue investigating the problem.

Just an update this is not abandoned at this moment.

LordCarve commented 7 months ago

Oh boy I found out what the problem is, and it's not pretty. Will update the OP shortly.

metalgearsloth commented 7 months ago

Is it just GetState not being cloned on client.

LordCarve commented 7 months ago

Is it just GetState not being cloned on client.

I don't think so. Even if we wiped all dirty components every tick and re-cloned them, both the "destroy during wipe" act and "create during clone" act could cause same kind of side-effects (in fact I believe excessive cloning here would only exacerbate the problem).

ApplyState is causing side effects like this (simplified):

  1. New server state has some fixture deleted
  2. During ApplyState client deletes this fixture
  3. But if this fixture was currently colliding, deleting it causes side effect event OnCollisionEnd to be fired
  4. That event may cause changes to the state - if it does, then Server's state ≠ Client's state. Hard desync

It's most prominent with fixtures, but it could be any side effect during ApplyState:

  1. New server state has some new entity
  2. During ApplyState this new entity is created
  3. But this new entity also happens to trigger an event instantly upon being created. OnSomethingHappened event is fired
  4. If OnSomethingHappened changes values, this leads to desync

The reason that this has gone unreported for so long and nobody really noticed the issue, is that usually server-side logic resolves the event-causing scenario, so as it gets to Client, ApplyState only sets the component values and does not trigger any side-effects. Example: for client-side OnCollisionEnd to trigger, it also has to trigger on server-side (client replicates server state after all). And if OnCollisionEnd triggers server-side, server handles that event and already applies all the component changes which are then sent via netcode. Handling OnCollisionEnd on the server USUALLY makes the fixtures no longer collide. If a no-longer-colliding state is replicated on Client, then OnCollisionEnd is not fired again. This is the usual case in which everything works fine despite the issue being present (nobody notices the problem). The other case is the opposite, where server-side OnCollisionEnd ends up not clearing the precondition required for OnCollisionEnd, and since the condition is met, as Client replicates the state it re-triggers OnCollisionEnd again. Which it shouldn't do! Server state already includes the changes done when handling that event... but it does, and - boom - desync.

LordCarve commented 7 months ago

The easiest way to replicate (theoretical - haven't actually done it) is this:

Define an entity that has an accumulator (int), initially 0. After* being instantiated it also triggers an event that it itself catches, and increases own accumulator by 1.

Instantiate the entity server-side and watch the magic happen:

  1. Entity is created. Accumulator value is 0. Event is sent. Event is handled - accumulator value becomes 1 (0 + 1).
  2. Game state is sent to client. It contains a new entity whose component's accumulator has value 1.
  3. Client replicates this state during ApplyState. New entity is created so it creates it. It also clones accumulator value 1.
  4. But then, after* instantiation, event is fired on Client (oh no!). As event is handled, it increases accumulator value by 1 again. Now accumulator value on client is 2.

Server: 1 vs Client: 2

* - it would take some wiggling to make the event be send after creation, not in-the-middle-of-it, but it should be doable as a proof of concept. In reality such an event would be triggered in the middle of instantiation and thus before applying server state - in that case component values end up being overwritten anyway, regardless of whether an event ends up changing them or not.