Uptime reports delayed while node is online and connected

We've had a couple different reports from different farmers that their nodes are showing brief period of downtime without a clear cause.

Example 1

Here's one example (times are CEST in this case):

And here are a selection of relevant log entries for node 5158:

2024-05-13 07:10:30 [+] powerd: 2024-05-13T05:10:30Z info node uptime hash hash=0x94aab23ecc60a11f2a71c88f6eb5ce1b53e59c2cc2bb865f8e918d8bdfc4e26a
2024-05-13 07:10:24 [-] powerd: 2024/05/13 05:10:24 Connecting to wss://tfchain.grid.tf/...
2024-05-13 07:10:24 [+] powerd: 2024-05-13T05:10:24Z error failed to connect to endpoint, retrying error="error getting latest metadata at 'wss://03.tfchain.grid.tf/': read tcp 192.168.100.122:33776->195.192.213.16:443: read: connection timed out"
2024-05-13 06:44:48 [-] powerd: 2024/05/13 04:44:48 Connecting to wss://03.tfchain.grid.tf/...
2024-05-13 06:04:48 [+] powerd: 2024-05-13T04:04:48Z info node uptime hash hash=0xaf6cca8afecd5defe275980d3c96982cf9f887c0e161b5b99d1e68e27883be47

Example 2

Here's another example (EDT this time):

Logs:

2024-05-12 20:58:54 [+] powerd: 2024-05-13T00:58:54Z info node uptime hash hash=0x66d2d49a6b6966222a6cfece181a60fc54b20f1e5ef5d4d9910e0b15c8285a03
2024-05-12 20:58:47 [-] powerd: 2024/05/13 00:58:47 Connecting to wss://tfchain.grid.tf/...
2024-05-12 20:58:46 [+] powerd: 2024-05-13T00:58:46Z error failed to connect to endpoint, retrying error="error getting node time at 'wss://03.tfchain.grid.tf/': failed to lookup entity: read tcp 192.168.18.8:55360->195.192.213.16:443: read: connection timed out"
2024-05-12 20:29:37 [+] provisiond: 2024-05-13T00:29:37Z info contract pause state contract=451762 paused=false twin=6638
2024-05-12 20:29:36 [+] provisiond: 2024-05-13T00:29:36Z info contract pause state contract=451366 paused=false twin=6638
2024-05-12 20:29:35 [-] provisiond: 2024/05/13 00:29:35 Connecting to wss://04.tfchain.grid.tf/...
2024-05-12 20:27:30 [-] powerd: 2024/05/13 00:27:30 Connecting to wss://03.tfchain.grid.tf/...
2024-05-12 19:47:30 [+] powerd: 2024-05-12T23:47:30Z info node uptime hash hash=0x5438f9887771fb392b8aa3d8f2a094a933b6d6742cf9d97b7e6f4c182bf77d0f
2024-05-12 19:47:24 [-] powerd: 2024/05/12 23:47:24 Connecting to wss://02.tfchain.grid.tf/...

I've included the logs from provisiond here to show that the node has nearly simultaneous successful communication with tfchain at the time it also tried to submit its uptime report and failed.

Example 3

Here's one more example (PDT here):

2024-05-13 10:19:48 [+] powerd: 2024-05-13T17:19:48Z info node uptime hash hash=0x3fd7ec113b27a33a89e494f9c3585f5389a6ca8916d15eff4dbc4bac7fe2da45
2024-05-13 10:19:40 [-] powerd: 2024/05/13 17:19:40 Connecting to wss://04.tfchain.grid.tf/...
2024-05-13 10:19:40 [+] powerd: 2024-05-13T17:19:40Z error failed to connect to endpoint, retrying error="error getting latest metadata at 'wss://03.tfchain.grid.tf/': read tcp [2601:1c2:4901:8371::345]:55234->[2a02:16a8:dc:501:fc88:1fff:fe2c:866f]:443: read: connection timed out"
2024-05-13 09:47:36 [-] powerd: 2024/05/13 16:47:36 Connecting to wss://03.tfchain.grid.tf/...
2024-05-13 09:07:36 [+] powerd: 2024-05-13T16:07:36Z info node uptime hash hash=0x2622eaf95913408067befccb22cf29fedd3c63d2889d5ed2be0cd0e24188486a

This seems to be widespread. The last example is happening to my own node as I write this. I haven't noticed any network outages on my uplink, and nor has anything changed with this node.

Here's another case that's slightly different but I think still fits the pattern. Node id is 6774. This happens during a farmerbot wake up.

Uptime is logged when the node boots up:

[+] powerd: 2024-05-16T08:33:54Z info node uptime hash hash=0x6d19f9aa06baa2b57e22ba35c6f071b5a8ce2e4e5b8ee0a9173e44d9b3db335a

We'd expect to see an uptime report made 40 minutes later, but there's no connection to tfchain corresponding precisely with the expected timestamp:

[+] noded: 2024-05-16T09:11:06Z info got power target change event node=6774
[+] redis: 3066:M 16 May 2024 09:11:01.196 * Background saving terminated with success
[+] redis: 9854:C 16 May 2024 09:11:01.098 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
[+] redis: 9854:C 16 May 2024 09:11:01.098 * DB saved on disk
[+] redis: 3066:M 16 May 2024 09:11:01.096 * Background saving started by pid 9854
[+] redis: 3066:M 16 May 2024 09:11:01.095 * 100 changes in 300 seconds. Saving...
[+] noded: 2024-05-16T09:11:00Z info got power target change event node=6593
[+] noded: 2024-05-16T09:10:54Z info got power target change event node=6898
[+] noded: 2024-05-16T09:10:48Z info got power target change event node=6799
[+] noded: 2024-05-16T09:10:42Z info got power target change event node=6752
[-] powerd: 2024/05/16 09:10:36 Connecting to wss://03.tfchain.grid.tf/...
[+] noded: 2024-05-16T09:10:36Z info got power target change event node=6773
[+] powerd: 2024-05-16T09:10:30Z info powering on node node=6899 reason="target is up"
[+] powerd: 2024-05-16T09:10:30Z info received an event to power up target=6899
[-] powerd: 2024/05/16 09:10:30 Connecting to wss://04.tfchain.grid.tf/...
[+] noded: 2024-05-16T09:10:30Z info got power target change event node=6899
[+] powerd: 2024-05-16T09:09:22Z info powering on node node=6472 reason="target is up"
[+] powerd: 2024-05-16T09:09:22Z info received an event to power up target=6472
[-] powerd: 2024/05/16 09:09:22 Connecting to wss://02.tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:09:16Z info powering on node node=2991 reason="target is up"
[+] powerd: 2024-05-16T09:09:16Z info received an event to power up target=2991
[-] powerd: 2024/05/16 09:09:16 Connecting to wss://tfchain.grid.tf/...

When the node does finally send an uptime report before shutting down, it cycles through the tfchain nodes without errors except on the usual suspect 03 (this behavior I also didn't see before):

[-] powerd: 2024/05/16 09:40:48 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:42 Connecting to wss://02.tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:42Z info shutting down node because of chain
[+] powerd: 2024-05-16T09:40:37Z info setting node power state state=false
[-] powerd: 2024/05/16 09:40:37 Connecting to wss://tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:37Z error failed to connect to endpoint, retrying error="error connecting to substrate at 'wss://03.tfchain.grid.tf/': context deadline exceeded"
[-] powerd: 2024/05/16 09:40:27 Connecting to wss://03.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:26 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:26 Connecting to wss://02.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:25 Connecting to wss://tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:25Z error failed to connect to endpoint, retrying error="error connecting to substrate at 'wss://03.tfchain.grid.tf/': context deadline exceeded"
[+] powerd: 2024-05-16T09:40:18Z info node uptime hash hash=0x9c6b0e306a8cb57c368e28f77820b1a30823d2f22a3fcd77aae42a4115a0a731
[-] powerd: 2024/05/16 09:40:15 Connecting to wss://03.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:15 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:14 Connecting to wss://02.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:14 Connecting to wss://tfchain.grid.tf/...

threefoldtech / zos