Open scottyeager opened 1 month ago
Here's another case that's slightly different but I think still fits the pattern. Node id is 6774. This happens during a farmerbot wake up.
Uptime is logged when the node boots up:
[+] powerd: 2024-05-16T08:33:54Z info node uptime hash hash=0x6d19f9aa06baa2b57e22ba35c6f071b5a8ce2e4e5b8ee0a9173e44d9b3db335a
We'd expect to see an uptime report made 40 minutes later, but there's no connection to tfchain corresponding precisely with the expected timestamp:
[+] noded: 2024-05-16T09:11:06Z info got power target change event node=6774
[+] redis: 3066:M 16 May 2024 09:11:01.196 * Background saving terminated with success
[+] redis: 9854:C 16 May 2024 09:11:01.098 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
[+] redis: 9854:C 16 May 2024 09:11:01.098 * DB saved on disk
[+] redis: 3066:M 16 May 2024 09:11:01.096 * Background saving started by pid 9854
[+] redis: 3066:M 16 May 2024 09:11:01.095 * 100 changes in 300 seconds. Saving...
[+] noded: 2024-05-16T09:11:00Z info got power target change event node=6593
[+] noded: 2024-05-16T09:10:54Z info got power target change event node=6898
[+] noded: 2024-05-16T09:10:48Z info got power target change event node=6799
[+] noded: 2024-05-16T09:10:42Z info got power target change event node=6752
[-] powerd: 2024/05/16 09:10:36 Connecting to wss://03.tfchain.grid.tf/...
[+] noded: 2024-05-16T09:10:36Z info got power target change event node=6773
[+] powerd: 2024-05-16T09:10:30Z info powering on node node=6899 reason="target is up"
[+] powerd: 2024-05-16T09:10:30Z info received an event to power up target=6899
[-] powerd: 2024/05/16 09:10:30 Connecting to wss://04.tfchain.grid.tf/...
[+] noded: 2024-05-16T09:10:30Z info got power target change event node=6899
[+] powerd: 2024-05-16T09:09:22Z info powering on node node=6472 reason="target is up"
[+] powerd: 2024-05-16T09:09:22Z info received an event to power up target=6472
[-] powerd: 2024/05/16 09:09:22 Connecting to wss://02.tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:09:16Z info powering on node node=2991 reason="target is up"
[+] powerd: 2024-05-16T09:09:16Z info received an event to power up target=2991
[-] powerd: 2024/05/16 09:09:16 Connecting to wss://tfchain.grid.tf/...
When the node does finally send an uptime report before shutting down, it cycles through the tfchain nodes without errors except on the usual suspect 03 (this behavior I also didn't see before):
[-] powerd: 2024/05/16 09:40:48 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:42 Connecting to wss://02.tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:42Z info shutting down node because of chain
[+] powerd: 2024-05-16T09:40:37Z info setting node power state state=false
[-] powerd: 2024/05/16 09:40:37 Connecting to wss://tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:37Z error failed to connect to endpoint, retrying error="error connecting to substrate at 'wss://03.tfchain.grid.tf/': context deadline exceeded"
[-] powerd: 2024/05/16 09:40:27 Connecting to wss://03.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:26 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:26 Connecting to wss://02.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:25 Connecting to wss://tfchain.grid.tf/...
[+] powerd: 2024-05-16T09:40:25Z error failed to connect to endpoint, retrying error="error connecting to substrate at 'wss://03.tfchain.grid.tf/': context deadline exceeded"
[+] powerd: 2024-05-16T09:40:18Z info node uptime hash hash=0x9c6b0e306a8cb57c368e28f77820b1a30823d2f22a3fcd77aae42a4115a0a731
[-] powerd: 2024/05/16 09:40:15 Connecting to wss://03.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:15 Connecting to wss://04.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:14 Connecting to wss://02.tfchain.grid.tf/...
[-] powerd: 2024/05/16 09:40:14 Connecting to wss://tfchain.grid.tf/...
We've had a couple different reports from different farmers that their nodes are showing brief period of downtime without a clear cause.
Example 1
Here's one example (times are CEST in this case):
And here are a selection of relevant log entries for node 5158:
Example 2
Here's another example (EDT this time):
Logs:
I've included the logs from provisiond here to show that the node has nearly simultaneous successful communication with tfchain at the time it also tried to submit its uptime report and failed.
Example 3
Here's one more example (PDT here):
This seems to be widespread. The last example is happening to my own node as I write this. I haven't noticed any network outages on my uplink, and nor has anything changed with this node.