lidgren / lidgren-network-gen3

Lidgren Network Library
https://groups.google.com/forum/#!forum/lidgren-network-gen3
MIT License
1.19k stars 331 forks source link

Random Connection Timed Out #81

Closed HellfireDrew closed 8 years ago

HellfireDrew commented 8 years ago

We have a Unity PS4 game with a Lidgren networking back-end for both server and client. However, for some reason, clients will get disconnected for seemingly no reason. The only indication of anything going wrong is the StatusChange message saying the connection timed out. We have already tried adjusting the time out lengths on both server and client. Actually, sometimes the time outs seem to occur before the time out period has even passed. It happens regardless of server load; and also, we see the same thing when connecting to our server through gigabit LAN. So, it doesn't seem like there is actually a time out going on... But we are totally at a loss. We don't really know what else to do. Any help in figuring this out would be greatly appreciated.

forestrf commented 8 years ago

It seems that only pings can maintain a connection established, so even if there is traffic between the machines, if one losses one ping, it will timeout. I made this change to reset the timeout countdown with all traffic and the problem dissapeared: https://github.com/forestrf/lidgren-network-gen3/commit/0f8836b761acacf186d4e8323c71f0e076ff2a80 I hope it helps

HellfireDrew commented 8 years ago

I'm assuming this change would need to be incorporated in both the client and the server?

forestrf commented 8 years ago

If you mean if it needs to be on all of the compiled clients and servers, then yes. If the problem is the one I described and not other, the timeout can happen because the server kicks you and because a client disconnects, both thinking that the other is no longer there and acting according to it

fversnel commented 8 years ago

Why not make this into a pull request?

forestrf commented 8 years ago

Done, I was not secure about it even if it fixed the problem for me. I keeped the changes to a minimum

fversnel commented 8 years ago

Nice :+1:

HellfireDrew commented 8 years ago

We deployed the fix to our servers, and it doesn't seem to have changed anything, but we haven't updated the clients yet. We won't be able to for another few days. I'll post back when we can get it tested on clients.

HellfireDrew commented 8 years ago

To give a little more information: currently we typically have about 100 clients spread across 8 different servers. On each of those servers, we see a rate of about one connection time out per minute. People also report that when this happens, they sometimes are unable to reconnect right away; sometimes they will have to try for several minutes. I am assuming this is because they are trying to reconnect to the same server, because our servers are part of a round-robin, and there's probably some DNS caching somewhere down the line causing the address to be resolved to the same server. Now, as to why they can't actually connect back to the same server after the connection is lost on both ends, I have no idea.

forestrf commented 8 years ago

You can also try to increase in the server the timeout configuration to something bigger, it may be worth the shot. I am looking at the code and any message from the ordered or sequenced types should send an ACK that should reset the timeout countdown when received, so I think that there is some problem there.

HellfireDrew commented 8 years ago

We have already increased the timeout length to around 60 seconds, on both client and server. It didn't really seem to do anything.

The messages that are sent to these servers are all ReliableOrdered. Are you saying that the ACK might not be getting through?

forestrf commented 8 years ago

I dont understand how all that system works so I don't know what triggers the problem, but the logic seems right

HellfireDrew commented 8 years ago

Okay, so we fixed the reconnect problem. We were hitting our maximum users on the server, but we didn't have any log messages about it.

The connection time out still eludes us though.

HellfireDrew commented 8 years ago

We were finally able to deploy the client code yesterday. We are still seeing a heavy volume of connection time outs. We might load up Wireshark or something like that to analyze the packets and see if the connections are legitimately timing out. If they are legitimately timing out, then it must be something we're doing wrong on our end. Anything in particular that would be good to look out for?

HellfireDrew commented 8 years ago

It turns out that the remaining connection time outs were caused by a race condition in the version of mono we were running on our master servers, which was manifesting in the latest MongoDB C# driver. We recompiled the MongoDB C# driver source with some workaround code, plus we updated to mono 4.0, and we are now seeing a "normal" amount of connection time outs.

For reference, here is the MongoDB C# driver issue: https://jira.mongodb.org/browse/CSHARP-1144