Open egrigore opened 4 years ago
This looks like your coordinator is not able to ping your exporters. Please check on the coordinator log side whether the pings are send out to the exporters.
A possible reason would be that your coordinator is older than your exporters.
Coordinator was updated at the same time as exporters, I am trying to distinguish the coordinator log only for that exporter but I can't since the rest of the exporters are also in there.
It seems coordinator responds - there is no failure or timeout.
We will try to switch back to the previous version.
We will try to switch back to the previous version.
Note that this won't resolve your problem long-term. Please try to extract a usable coordinator and exporter log or, if prohibited by corporate policies, try to build a minimal reproduction case.
We have switched back only the exporters (coordinator still runs with latest version) to previous version and they seem stable for at least 3 hours.
I will try to create a minimal reproduction case. It might be a configuration that is not compatible with the latest changes?
We have switched back only the exporters (coordinator still runs with latest version) to previous version and they seem stable for at least 3 hours.
I will try to create a minimal reproduction case. It might be a configuration that is not compatible with the latest changes?
The exporter configuration did not change between older versions. What changed is that the crossbar configuration file location was moved, see c8756f2. For production deployments you should copy the configuration file and adjust the configuration to your environment. To me it sounds more like a firewalling issue, but YMMV. Note that for the exporter ping changes, crossbar and autobahn need to be updated to 19.11.0
.
Hi @Emantor , sorry for the late update, I didn't had time to debug more on this , we are still on the backup branch for exporter. Thank you for your suggestions :
Crossbar.io : 19.11.1 txaio : 20.4.1 Autobahn : 19.11.1 UTF8 Validator : autobahn XOR Masker : autobahn JSON Codec : stdlib MsgPack Codec : msgpack-0.6.2 CBOR Codec : cbor-1.0.0 UBJSON Codec : ubjson-0.15.0 FlatBuffers : flatbuffers-?.?.? Twisted : 20.3.0-EPollReactor LMDB : 0.98/lmdb-0.9.22 Python : 3.6.8/CPython Frozen executable : no Operating system : Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic Host machine : x86_64
Anyway I will try to find some time and recreate the issue on a smaller env and maybe engage the IT to take a look if it is any traffic blocked between exporter and coordinator.
What we have seen a lot is that the Exporter/Coordinator VM is I/O blocked for some reason, which leads to the Coordinator not being able to respond in time. In our case we investigated this using ioping
and logging the output into a file. This showed that under certain circumstances, we had IO ping times exceeding half a minute. Unfortunately the Websocket connections are quite sensitive to this kind of latency.
Hi,
We have rebased in January to the latest code , met the same problem, but we changed from "age > 300 " to "age > 400" in labgrid/remote/exporter.py and since then exporters are stable from this point of view (some of them are still failing rarely "missed checkpoint, exiting (last was 400.20326938666403 seconds ago)"). Can this affect anything else? Or is it ok to keep it like this ?
Did this happen only on single exporters at a time or multiple at once?
Could you check that your network connection between exporter and coordinator is stable? (This was the cause in our lab for some cases where this triggered).
After I have increased the timeout to 400 it happened very rarely (5-6 times per exporter in 6 weeks- not all at once , and half of them are indeed motivated by some IT network downtime). When it was set to 300 it happened every 10 minutes for each exporter (no matter their geo location).
Now we see on some of them only below message (without missing checkpoint part):
Feb 15 18:02:27 exporter service_start_exporter.sh[25258]: 2021-02-15T18:02:26 dropping connection to peer tcp4:<coordinator-IP>:20408 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
Feb 15 18:02:27 exporter service_start_exporter.sh[25258]: connection lost
Feb 15 18:02:27 exporter service_start_exporter.sh[25258]: 2021-02-15T18:02:26 session closed with reason wamp.close.transport_lost [WAMP transport was lost without closing the session 6852899289096070 before]
Feb 15 18:02:27 exporter systemd[1]: exporter.service: Main process exited, code=exited, status=100/n/a
Feb 15 18:02:27 exporter systemd[1]: exporter.service: Failed with result 'exit-code'.
For this exporter I have activated the --debug option from labgrid-exporter to see if we can see more when above happens. And for all the exporters we get the time to respond to ping from coordinator before exporter starts to monitor if there are changes in this also.
If you have other suggestions in how to monitor the connection between coordinator and exporter let me know.
Is this still happening on current master?
Hi, I haven't seen this error since we upgraded to the latest labgrid version, we kept the age to 400 in exporter.py (maybe this is a value that can be configurable ? ). I think we can close this ticket for now .
Hi,
We rebased to the labgrid version from 31 march: b0cc54dc2e2c9cf94c9c98e22f00a150ce475597 . All of our exporters are set up as services, since we made the rebase they are restarting at every 10 minutes with reason "missed checkpoint, exiting". There are many changes in exporter.py made in january - is any configuration that we have to make?
Below is the log of an exporter from when it starts to when it stops: