Probes disconnect too often

jsdelivr / globalping

A global network of probes to run network tests like ping, traceroute and DNS resolve

https://www.jsdelivr.com/globalping

247 stars 31 forks source link

Probes disconnect too often #124

Closed jimaek closed 1 year ago

jimaek commented 2 years ago

Task to track the issue where probes that are far away from EU like China and Costa Rica often disconnect and reconnect. Latency is only 256ms, so our previous timeout of 2seconds and now 4seconds should have been enough.

Currently we're trying https://github.com/uNetworking/uWebSockets.js

patrykcieszkowski commented 2 years ago

Both, the server and the client track the ping/pong timeout. If the client doesn't receive the ping request on time (pingInterval), it closes the connection with error ping timeout, and disconnects. The server captures the disconnect and reports transport close. On the flip side, if the server doesn't receive pong response on time (pingTimeout), it reports ping timeout, and the client sees it as severed connection - reports transport close error.

https://github.com/socketio/socket.io/issues/3191 https://github.com/socketio/socket.io/issues/4333

either way, the issue is due to timeout.

jimaek commented 2 years ago

Small summary:

Some probes are just too far away and should be working correctly
TCP + keep-alive + pings should more or less handle those issues without constant re-connections
Current 4000ms timeout x2 retries seems too high. EU to Costa Rica is only 250ms latency.
We need to disconnect probes only if we are certain it has real internet issues and can't process real tests. Thats why a lower timeout is needed
Assume in the near future we will get A LOT of low quality probes. The system should be able to handle them gracefully without re-connecting every 10 seconds

MartinKolarik commented 1 year ago

Not sure how relevant this still is.

jimaek commented 1 year ago

It is and a lot. We just need to setup proper logging first to see all the problems

alexey-yarmosh commented 1 year ago

So I've managed to reproduce the issue using https://github.com/tylertreat/comcast tool. Ping timeouts for my local probe stably occurs on the GPRS and sometimes on EDGE network quality (https://github.com/tylertreat/comcast#network-condition-profiles).

I've tried switching socket.io transport from 'websocket' to 'polling' and different combinations of that, but nothing changed.

One of the faulty probes is located on VPS that we own. It disconnects ~10 times an hour and reconnects in a few seconds. The VPS seems to be slow and unresponsive during ssh. Network speed tests shows max latency of 1000ms. Seems like throughput regularly drops for a few seconds. So all of that points to the server network problems that we are not able to deal with.

So I believe we should accept that the total number of effective probes will vary all the time. We are adding monitoring to see how many disconnects happens in a time interval. Above of that we should add a mechanism to explicitly "ignore" faulty probes. It can be a manual list, circuit breaker, or smth else, here is the issue for that: https://github.com/jsdelivr/globalping/issues/52