lidgren / lidgren-network-gen3

Lidgren Network Library
https://groups.google.com/forum/#!forum/lidgren-network-gen3
MIT License
1.19k stars 331 forks source link

ReceivedPong doesn't consider late/not in order pong messages #131

Open aienabled opened 4 years ago

aienabled commented 4 years ago

Hello!

While investigating random timeout connection issues (on perfect connections) I've noticed that ReceivedPong could handle only the latest ping request (m_sentPingNumber), which is certainly not perfect in some cases https://github.com/lidgren/lidgren-network-gen3/blob/c18c8b96329772f49ec43362d4a219c1820870d6/Lidgren.Network/NetConnection.Latency.cs#L90

I've also noticed that NetConnection.ResetTimeout() call in the NetReliableSenderChannel to not working as expected...the client is dropping connection by timeout (set as 8 seconds) in some rare but 100% reproducible (for some clients) cases. I will keep investigating this issue.

Regards!

kenfalco commented 4 years ago

Hi, we are also investigating the connection timeout. Looking at the code, I noticed that the deadline is extended only when it receives the PONG from its PING (client). In my opinion it should also be extended in the following cases:

1) I get a PING from the server 2) I receive any packet from the server

What do you think?

P.S aienabled yes we have extended the deadline also within this if if ((byte) pongNumber! = (byte) m_sentPingNumber)

aienabled commented 4 years ago

@kenfalco

  1. sounds good.
  2. with this change I think we might get some corner cases when the bad connection barely can manage the traffic but doesn't get a timeout. BTW, for reliable data sender it seems to be already implemented here https://github.com/lidgren/lidgren-network-gen3/blob/c18c8b96329772f49ec43362d4a219c1820870d6/Lidgren.Network/NetReliableSenderChannel.cs#L217 (it's resetting the connection timeout when getting ACK under certain circumstances).

BTW, since reporting the issue I have not tried any changes in the ReceivedPong method. I'm considering trying them now but there is one observation that makes me confused. We have few users with ideal connections (ping around 30 ms and no packet loss, as indicated by WinMTR tool) which are getting disconnects consistently—often to all the servers we have around the world! So it doesn't seem to be relevant to the ping result handling as ping packets are sent relatively rarely and should be arriving in the correct order due to a low ping and no packet loss.

kenfalco commented 4 years ago

Hi, I checked and I have an old version of Lidgren (2014) and I didn't have this line of code.

m_connection.ResetTimeout (now);

I have analyzed the differences between my code and new Lidgren and there is this thing that I don't like:

lock (m_connections) { foreach (NetConnection conn in m_connections) { conn.Heartbeat(now, m_frameCounter); if (conn.m_status == NetConnectionStatus.Disconnected) { // // remove connection // m_connections.Remove(conn); m_connectionLookup.Remove(conn.RemoteEndPoint); break; // can't continue iteration here } } }

Why break if you remove a connection? For example, if you find a connection halfway through each time, it may take a long time to process some connections and maybe can time out.

This is the code in my version:

lock (m_connections) { for (int i = m_connections.Count - 1; i >= 0; i--) { var conn = m_connections[i]; conn.Heartbeat(now, m_frameCounter); if (conn.m_status == NetConnectionStatus.Disconnected) { // // remove connection // m_connections.RemoveAt(i); m_connectionLookup.Remove(conn.RemoteEndPoint); } } }

It seems more correct to me.

P.S We are investigating why with 11,000+ users at the same time we have group disconnections of about 1000-2000 users. we don't know if can be a hardware or software problem.

Online

aienabled commented 4 years ago

@kenfalco I've checked the master branch in this repository and there is no break in the connections heartbeat foreach loop https://github.com/lidgren/lidgren-network-gen3/blob/master/Lidgren.Network/NetPeer.Internal.cs#L374

I see, so your issue is different. Not some random user disconnecting but massive disconnects. I think this chart alone (online stats) is not enough to make an educated guess on what could be a culprit. The connections drop could be related to:

  1. performance issue in your application which is causing socket buffers overflow or processing thread freeze
  2. GC sudden spikes taking several seconds to complete
  3. ISP or routing issues—router failure, etc. It could be also related to anti-DDoS feature in the datacenter.

It's best to gather more metrics in your application (+GC events), OS, and hardware (CPU/RAM/network).

kenfalco commented 4 years ago

Yes you are right I was looking at another version of Lidgren.

We have analyzed lidgren's logs and the only message that often appears in the most busy hours is:

Socket exception: System.Net.Sockets.SocketException (0x80004005): The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress at System.Net.Sockets.Socket.ReceiveFrom (Byte [] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, EndPoint & remoteEP) at Lidgren.Network.NetPeer.Heartbeat () in D: \ Workswc \ CardGames \ trunk \ Online Payment Server \ Lidgren.Network \ NetPeer.Internal.cs: line 432

As soon as possible we will update Lidgren to the latest version.

aienabled commented 4 years ago

@kenfalco I've found a mention that it's related to the packet TTL expiration (even in the case of UDP packets):

For a datagram socket, this error indicates that the time to live has expired.

You can try increasing packet TTL—for example m_socket.Ttl = 255; put after this line: https://github.com/lidgren/lidgren-network-gen3/blob/3ab2d8d36867ccd9e10d7b1c61ed4ccc8ce5172f/Lidgren.Network/NetPeer.Internal.cs#L133 will increase it to max possible. I would also recommend temporarily relocating to a new hardware in another datacenter just to ensure the issue is not related to the hardware/network failure.

kenfalco commented 4 years ago

Hi, thanks for the advice i will try it. P.S I tried to download the latest version of Lidgren but it doesn't compile on IOS. First he gave this error:

https://github.com/lidgren/lidgren-network-gen3/issues/114

And I fixed it as they say at the end of the post.

Then he started giving this other ...

NullReferenceException: Object reference not set to an instance of an object. at Lidgren.Network.NetPeer.InitializeNetwork () [0x00000] in <00000000000000000000000000000000>: 0 at Lidgren.Network.NetPeer.Start () [0x00000] in <00000000000000000000000000000000>: 0 at Menu.Update () [0x00000] in <00000000000000000000000000000000>: 0

But is Unity + IOS no longer supported?

aienabled commented 4 years ago

@kenfalco I cannot provide a response as I'm not the library developer and not developing any applications for iOS. My contributor bage is only because I've submitted a pull requested and it was accepted.