mas-bandwidth / netcode

Secure client/server connections over UDP
BSD 3-Clause "New" or "Revised" License
2.44k stars 191 forks source link

Fix encryption mapping expiration race-condition #111

Closed kbirk closed 1 year ago

kbirk commented 1 year ago

Hello,

I was doing some load testing and I discovered a race condition in the netcode server logic related to adding and removing encryption mappings.

The issue can occur when the server receives a packet from a connected client on the exact frame is it set to timeout. In this frame if either netcode_encryption_manager_add_encryption_mapping or netcode_encryption_manager_remove_encryption_mapping is executed for another client, netcode_encryption_manager_entry_expired will return true, which allows a connecting client to "steal" the encryption mapping, or a disconnecting client to clear it out.

It's extremely unlikely to occur, but I've been able to reliably reproduce it by spamming a server with clients that connect, send a bunch of packets with frame sleeps ranging from 10ms to >timeout, and then disconnect (or get timed out).

Here's a walkthrough of how both scenarios can occur:

Scenario 1) netcode_encryption_manager_add_encryption_mapping "stealing" another clients encryption mapping:

Now both the server->client_encryption_index entries for ClientA and ClientB refer to the same encryption_index. So whenever the server calls netcode_encryption_manager_touch for ClientA, the mapping is no longer associated with its address and logs: error: encryption mapping is out of date for client X, preventing it from sending packets.

From this point one of two situations follows, either:

1) ClientA is starved of packets and disconnects from it's end, netcode_server_disconnect_client_internal is executed but netcode_encryption_manager_remove_encryption_mapping fails to remove the mapping, since it can no longer associate ClientA's address to it. Now the server state is "corrected". Or: 2) ClientB disconnects before ClientA disconnects, netcode_server_disconnect_client_internal is executed and netcode_encryption_manager_remove_encryption_mapping removes the mapping and decrements server->encryption_manager.num_encryption_mappings. Which will cause ClientA to trigger the netcode_assert( index < encryption_manager->num_encryption_mappings ); in netcode_encryption_manager_touch.

Scenario 2) netcode_encryption_manager_remove_encryption_mapping prematurely clearing a clients encryption mapping:

Now when clientA attempts to access its encryption_mapping a netcode_assert( index < encryption_manager->num_encryption_mapping ) will trigger.

The Solution:

I'm sure there is a better way to do this, but I figured the easiest solution was to just prevent expired mappings that belong to a connected client from being assigned / cleared until after that client is disconnected. To do this I added a client_index field to the encryption_manager struct: an inverse of the servers client_encryption_index. With the changes made in this PR I was no longer able to reproduce the issue. I would normally be able to reproduce the issue within a few minutes of running my tests, I ran this branch for 2 hours with no incidents.

Also, slightly related, this PR removes the return from netcode_server_check_for_timeouts, which prevented timing out more than one client per frame. I'm not sure if that was intentional or not.

gafferongames commented 1 year ago

Excellent