Uptime reporting sometimes fails with TF Chain error and is not retried

threefoldtech / zos

Autonomous operating system

https://threefold.io/host/

Apache License 2.0

82 stars 13 forks source link

Uptime reporting sometimes fails with TF Chain error and is not retried #2268

Open scottyeager opened 5 months ago

scottyeager commented 5 months ago

I have noticed that some uptime report extrinsics submitted by nodes fail with an error generated by the TF Chain runtime and Zos does not seem to detect or retry them.

Here's an example.

Node 3679 attempted to report uptime at block 11421695.

Checking the events and extrinsics view, we can ascertain that this call was rejected with an InvalidTimestampHint error. The timestamp hint matches the timestamp from the log file, but this block was not generated until about a minute later. I guess this transaction was sitting in the mempool too long.

From the perspective of Zos logs, this looks like any other successful uptime report:

We can see from there that the node continued on its usual cycle and did another uptime report about 40 minutes later that went through fine.

While this is a rare issue with low impact (40 minutes of lost uptime potentially), the fact seems to be that the node is working properly but getting penalized due to details of operation of tfchain.

muhamadazmy commented 4 months ago

That's a good catch. Can't think even how you were able to spot this one.

I looked into it and from a first glance there is no clear explanation how this happened. But all inputs are correct.

Also notes about the implementation:

on sending a transaction we subscribe to its events here you can see also we wait maximum of 30 seconds before we assume transaction has timedout! Which is weird because transaction took a minute to commit
We only assume InBlock state for a transaction as a success (that comes before IsFinalized) , which raises the questions if transactions in mempool are assumed to be InBlock (i honestly think not, Lee also thinks that this is not true)
You can see that the log contains the correct hash for this transaction which means it's not any of the other cases where an error is returned. The call did return no error definitely
What only remains is if the substrate client itself (the official client) submit-and-subscribe call is wrong and it actually has processed the event of another extrinsic success.
- We can make sure the client is updated to latest

scottyeager commented 4 months ago

Can't think even how you were able to spot this one.

I'll spare the details, but let's just say it was a productive coincidence :slightly_smiling_face:

We only assume InBlock state for a transaction as a success (that comes before IsFinalized) , which raises the questions if transactions in mempool are assumed to be InBlock (i honestly think not, Lee also thinks that this is not true)

So I found a bit of reference about these states here. The relevant excerpt:

InBlock does not mean the block is finalized, but rather applies to the transaction state, where Finalized means that the transaction cannot be forked off the chain.

To me that's not 100% clear by itself, but here's another clue.

The retracted state means that the transaction was included into a block X. This block X was seen as best block of the chain. At some later point there was another block Y imported as best block and this block Y is a fork of block X. So, the transaction pool sends the retracted event for all the transactions included in block X as they are not part of the best chain anymore.

I think this could explain what we're seeing. The transaction was included in a block, but that block ended up discarded when the chain reorganized to favor a fork without that block. This could help explain the one minute delay in the transaction getting finalized (the part of the network with the actual best chain didn't know about this transaction yet).

muhamadazmy commented 4 months ago

Yeah, i understand what you trying to say. And if that is the case, I don't believe there is anything we can do except hope that the next uptime is included. Note that we didn't lose any uptime unless the node dies before it send the next one because once the (next) uptime is sent it will include the full period the node was up. so the minting will know that the node were actually up during this time even if there is a missing uptime report

scottyeager commented 4 months ago

Note that we didn't lose any uptime unless the node dies before it send the next one because once the (next) uptime is sent it will include the full period the node was up. so the minting will know that the node were actually up during this time even if there is a missing uptime report

This actually isn't true anymore. In more recent versions of minting, uptime is only credited if nodes are reporting regularly every 40 minutes (with one minute grace period).

That is a small amount of uptime, but in the case of gold certified nodes where the uptime requirement only allows for ~90 minutes of downtime per month, it is actually fairly substantial.

Unless I am missing something, it seems the simple fix is to wait for IsFinalized and retry if it's not achieved within the expected finality time. It's not easy to find a figure quoted for this, but one source indicates that the standard GRANDPA config should take 18-24 seconds to finalize a set of blocks.

scottyeager commented 4 months ago

It seems the situation is actually worse. The problem is that farmerbot managed nodes must produce a specific sequence of events in order to avoid penalties at minting time. A single missed uptime event can lead to a "node didn't boot within 30 minutes" violation, even if the node successfully changes its power state to Up.

Here's the new example. It's a bit different than before, because in this case I wasn't able to find an extrinsic failed event (within a day's blocks after the logged attempt). Rather, I think the tfchain node that accepted the extrinsic was simply faulty.

Node 4276 on mainnet, attempts to submit an uptime report:

[+] powerd: 2024-05-03T15:28:19Z info node uptime hash hash=0x8b0d754ef270649ba4c5a83e71846659e510aa5111878ef640c71cbeb55750d6
...
[-] powerd: 2024/05/03 15:28:18 Connecting to wss://03.tfchain.grid.tf/...

Quick check in GraphQL shows no corresponding events:

The next uptime event we find for this node is at 1714752054. Strangely enough the hash log for this successful uptime report seems to be missing:

The last thing I noticed here is that the node appears to have submitted the failed uptime report to node 03.tfchain.grid.tf, and I know that this node has some issue because there are a lot of reports of failed attempts to connect to it in other node logs. So I suspect that this node accepted the extrinsic, placed it into an unfinalized block, and then never reconciled with the winning chain.

scottyeager commented 4 months ago

Checking another example, I am beginning to doubt that it was the node at fault as I wrote above. In this instance there is again a hash logged by Zos, but no resulting uptime event can be found in tfchain and also no extrinsic failed event.

Node 6587 on mainnet:

[+] powerd: 2024-05-22T08:11:25Z info node uptime hash hash=0x5654ad02f3267b65588de0cbbb45fcfe4ced30626ba28a95f2590c3cfbb4ea75
...
[-] powerd: 2024/05/22 08:11:24 Connecting to wss://04.tfchain.grid.tf/...

Again, no corresponding event in GraphQL (this time framed by the two events that are detected on either side):

And my scan of the chain also returns neither a successful uptime event related to this extrinsic nor a failed extrinsic event. Maybe there's something to be corrected on the tfchain side, but my overall impression is still that Zos is not robust enough about ensuring that uptime reports actually get included in the finalized chain.

LeeSmet commented 1 week ago

Checking another example, I am beginning to doubt that it was the node at fault as I wrote above. In this instance there is again a hash logged by Zos, but no resulting uptime event can be found in tfchain and also no extrinsic failed event.

Node 6587 on mainnet:
[+] powerd: 2024-05-22T08:11:25Z info node uptime hash hash=0x5654ad02f3267b65588de0cbbb45fcfe4ced30626ba28a95f2590c3cfbb4ea75
...
[-] powerd: 2024/05/22 08:11:24 Connecting to wss://04.tfchain.grid.tf/...
Again, no corresponding event in GraphQL (this time framed by the two events that are detected on either side):

And my scan of the chain also returns neither a successful uptime event related to this extrinsic nor a failed extrinsic event. Maybe there's something to be corrected on the tfchain side, but my overall impression is still that Zos is not robust enough about ensuring that uptime reports actually get included in the finalized chain.

For this one, it seems that the hash logged is actually the hash of a block, not an extrinsic. And while looking through the block, there seems to be an UpdateNode extrinsic included which was submitted by the node. This would suggest some kind of race in zos itself, where the success of one extrinsic submission propagates to other pending extrinsics

LeeSmet commented 1 week ago

A quick check shows exactly the same result for node 4276 in the previous comment, the hash points to a block, and while there is no uptime event there is an updatenode event

scottyeager commented 1 week ago

Thanks for the additional analysis on this @LeeSmet.

I wonder if there was something to this after all:

What only remains is if the substrate client itself (the official client) submit-and-subscribe call is wrong and it actually has processed the event of another extrinsic success.

We can make sure the client is updated to latest

There is an updated version available of the upstream client. We're using v4.0.12 and latest is v4.2.1.

Of course it's also possible that we're using the library in an unsafe way, and ruling that out would be worthwhile.