jsdelivr / globalping

A global network of probes to run network tests like ping, traceroute and DNS resolve
https://www.jsdelivr.com/globalping
247 stars 31 forks source link

Probes disconnect too often #124

Closed jimaek closed 1 year ago

jimaek commented 2 years ago

Task to track the issue where probes that are far away from EU like China and Costa Rica often disconnect and reconnect. Latency is only 256ms, so our previous timeout of 2seconds and now 4seconds should have been enough.

Currently we're trying https://github.com/uNetworking/uWebSockets.js

patrykcieszkowski commented 2 years ago

Both, the server and the client track the ping/pong timeout. If the client doesn't receive the ping request on time (pingInterval), it closes the connection with error ping timeout, and disconnects. The server captures the disconnect and reports transport close. On the flip side, if the server doesn't receive pong response on time (pingTimeout), it reports ping timeout, and the client sees it as severed connection - reports transport close error.

https://github.com/socketio/socket.io/issues/3191 https://github.com/socketio/socket.io/issues/4333

either way, the issue is due to timeout.

jimaek commented 2 years ago

Small summary:

MartinKolarik commented 1 year ago

Not sure how relevant this still is.

jimaek commented 1 year ago

It is and a lot. We just need to setup proper logging first to see all the problems

alexey-yarmosh commented 1 year ago

So I've managed to reproduce the issue using https://github.com/tylertreat/comcast tool. Ping timeouts for my local probe stably occurs on the GPRS and sometimes on EDGE network quality (https://github.com/tylertreat/comcast#network-condition-profiles).

I've tried switching socket.io transport from 'websocket' to 'polling' and different combinations of that, but nothing changed.

One of the faulty probes is located on VPS that we own. It disconnects ~10 times an hour and reconnects in a few seconds. The VPS seems to be slow and unresponsive during ssh. Network speed tests shows max latency of 1000ms. Seems like throughput regularly drops for a few seconds. So all of that points to the server network problems that we are not able to deal with.

So I believe we should accept that the total number of effective probes will vary all the time. We are adding monitoring to see how many disconnects happens in a time interval. Above of that we should add a mechanism to explicitly "ignore" faulty probes. It can be a manual list, circuit breaker, or smth else, here is the issue for that: https://github.com/jsdelivr/globalping/issues/52