I was doing some load testing and I discovered a race condition in the netcode server logic related to adding and removing encryption mappings.
The issue can occur when the server receives a packet from a connected client on the exact frame is it set to timeout. In this frame if either netcode_encryption_manager_add_encryption_mapping or netcode_encryption_manager_remove_encryption_mapping is executed for another client, netcode_encryption_manager_entry_expired will return true, which allows a connecting client to "steal" the encryption mapping, or a disconnecting client to clear it out.
It's extremely unlikely to occur, but I've been able to reliably reproduce it by spamming a server with clients that connect, send a bunch of packets with frame sleeps ranging from 10ms to >timeout, and then disconnect (or get timed out).
Here's a walkthrough of how both scenarios can occur:
Scenario 1) netcode_encryption_manager_add_encryption_mapping "stealing" another clients encryption mapping:
ClientA successfully connects to the server, gets it's encryption_index index assigned to server->client_encryption_index.
The server does not receive a packet from ClientA until exactly the frame that it would expire / timeout.
In that frame three things happen:
1) server->time is stepped such that the method netcode_encryption_manager_entry_expired returns true for ClientA's encryption_mapping
2) The server receives a NETCODE_CONNECTION_REQUEST_PACKET from a new ClientB. While executing netcode_server_process_connection_request_packet for ClientB's NETCODE_CONNECTION_REQUEST_PACKET the netcode_encryption_manager_add_encryption_mapping mapping re-assigns ClientA's encryption mapping to ClientB (because it returns true for netcode_encryption_manager_entry_expired)
3) The server then processes ClientA's packet to keep the connection for ClientA alive, bumping its client_last_packet_receive_time preventing netcode_server_check_for_timeouts from disconnecting the client.
Now both the server->client_encryption_index entries for ClientA and ClientB refer to the same encryption_index. So whenever the server calls netcode_encryption_manager_touch for ClientA, the mapping is no longer associated with its address and logs: error: encryption mapping is out of date for client X, preventing it from sending packets.
From this point one of two situations follows, either:
1) ClientA is starved of packets and disconnects from it's end, netcode_server_disconnect_client_internal is executed but netcode_encryption_manager_remove_encryption_mapping fails to remove the mapping, since it can no longer associate ClientA's address to it. Now the server state is "corrected". Or:
2) ClientB disconnects before ClientA disconnects, netcode_server_disconnect_client_internal is executed and netcode_encryption_manager_remove_encryption_mapping removes the mapping and decrements server->encryption_manager.num_encryption_mappings. Which will cause ClientA to trigger the netcode_assert( index < encryption_manager->num_encryption_mappings ); in netcode_encryption_manager_touch.
Scenario 2) netcode_encryption_manager_remove_encryption_mapping prematurely clearing a clients encryption mapping:
ClientA successfully connects to the server, gets it's encryption_index assigned to server->client_encryption_index.
The server does not receive a packet from ClientA until exactly the frame that it would expire / timeout.
In the single frame three things happen:
1) server->time is stepped such that the method netcode_encryption_manager_entry_expired returns true for ClientA's encryption_mapping
2) The server receives a NETCODE_CONNECTION_DISCONNECT_PACKET from some other clientX, whose encryption_index is encryption_manager->num_encryption_mapping - 1. While executing netcode_encryption_manager_remove_encryption_mapping, netcode_encryption_manager_entry_expired returns true for ClientA's mapping and it is cleared out.
3) The server then receives a packet to keep the connection for ClientA alive. While processing the packet for ClientA, the client_last_packet_receive_time is bumped, which prevents netcode_server_check_for_timeouts from disconnecting the client.
Now when clientA attempts to access its encryption_mapping a netcode_assert( index < encryption_manager->num_encryption_mapping ) will trigger.
The Solution:
I'm sure there is a better way to do this, but I figured the easiest solution was to just prevent expired mappings that belong to a connected client from being assigned / cleared until after that client is disconnected. To do this I added a client_index field to the encryption_manager struct: an inverse of the servers client_encryption_index. With the changes made in this PR I was no longer able to reproduce the issue. I would normally be able to reproduce the issue within a few minutes of running my tests, I ran this branch for 2 hours with no incidents.
Also, slightly related, this PR removes the return from netcode_server_check_for_timeouts, which prevented timing out more than one client per frame. I'm not sure if that was intentional or not.
Hello,
I was doing some load testing and I discovered a race condition in the netcode server logic related to adding and removing encryption mappings.
The issue can occur when the server receives a packet from a connected client on the exact frame is it set to timeout. In this frame if either
netcode_encryption_manager_add_encryption_mapping
ornetcode_encryption_manager_remove_encryption_mapping
is executed for another client,netcode_encryption_manager_entry_expired
will return true, which allows a connecting client to "steal" the encryption mapping, or a disconnecting client to clear it out.It's extremely unlikely to occur, but I've been able to reliably reproduce it by spamming a server with clients that connect, send a bunch of packets with frame sleeps ranging from 10ms to >timeout, and then disconnect (or get timed out).
Here's a walkthrough of how both scenarios can occur:
Scenario 1) netcode_encryption_manager_add_encryption_mapping "stealing" another clients encryption mapping:
encryption_index
index assigned toserver->client_encryption_index
.server->time
is stepped such that the methodnetcode_encryption_manager_entry_expired
returns true for ClientA's encryption_mapping 2) The server receives aNETCODE_CONNECTION_REQUEST_PACKET
from a new ClientB. While executingnetcode_server_process_connection_request_packet
for ClientB'sNETCODE_CONNECTION_REQUEST_PACKET
thenetcode_encryption_manager_add_encryption_mapping
mapping re-assigns ClientA's encryption mapping to ClientB (because it returns true fornetcode_encryption_manager_entry_expired
) 3) The server then processes ClientA's packet to keep the connection for ClientA alive, bumping itsclient_last_packet_receive_time
preventingnetcode_server_check_for_timeouts
from disconnecting the client.Now both the
server->client_encryption_index
entries for ClientA and ClientB refer to the sameencryption_index
. So whenever the server callsnetcode_encryption_manager_touch
for ClientA, the mapping is no longer associated with itsaddress
and logs:error: encryption mapping is out of date for client X
, preventing it from sending packets.From this point one of two situations follows, either:
1) ClientA is starved of packets and disconnects from it's end,
netcode_server_disconnect_client_internal
is executed butnetcode_encryption_manager_remove_encryption_mapping
fails to remove the mapping, since it can no longer associate ClientA's address to it. Now the server state is "corrected". Or: 2) ClientB disconnects before ClientA disconnects,netcode_server_disconnect_client_internal
is executed andnetcode_encryption_manager_remove_encryption_mapping
removes the mapping and decrementsserver->encryption_manager.num_encryption_mappings
. Which will cause ClientA to trigger thenetcode_assert( index < encryption_manager->num_encryption_mappings );
innetcode_encryption_manager_touch
.Scenario 2) netcode_encryption_manager_remove_encryption_mapping prematurely clearing a clients encryption mapping:
encryption_index
assigned toserver->client_encryption_index
.server->time
is stepped such that the methodnetcode_encryption_manager_entry_expired
returns true for ClientA's encryption_mapping 2) The server receives aNETCODE_CONNECTION_DISCONNECT_PACKET
from some other clientX, whoseencryption_index
isencryption_manager->num_encryption_mapping - 1
. While executingnetcode_encryption_manager_remove_encryption_mapping
,netcode_encryption_manager_entry_expired
returns true for ClientA's mapping and it is cleared out. 3) The server then receives a packet to keep the connection for ClientA alive. While processing the packet for ClientA, theclient_last_packet_receive_time
is bumped, which preventsnetcode_server_check_for_timeouts
from disconnecting the client.Now when clientA attempts to access its encryption_mapping a
netcode_assert( index < encryption_manager->num_encryption_mapping )
will trigger.The Solution:
I'm sure there is a better way to do this, but I figured the easiest solution was to just prevent expired mappings that belong to a connected client from being assigned / cleared until after that client is disconnected. To do this I added a
client_index
field to theencryption_manager
struct: an inverse of the serversclient_encryption_index
. With the changes made in this PR I was no longer able to reproduce the issue. I would normally be able to reproduce the issue within a few minutes of running my tests, I ran this branch for 2 hours with no incidents.Also, slightly related, this PR removes the
return
fromnetcode_server_check_for_timeouts
, which prevented timing out more than one client per frame. I'm not sure if that was intentional or not.