threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
82 stars 13 forks source link

Nodes with tfchain error don't update #2393

Open scottyeager opened 1 month ago

scottyeager commented 1 month ago

I noticed that I can't reach some mainnet nodes over RMB.

RMBError: 104 invalid envelope signature: sr25519 signature verification failed

Here's an example from the dashboard, when attempting to deploy a VM on node 1479:

image

Same result using the RMB proxy:

image

Here's a non exhaustive list of affected node ids on mainnet:

1087
1226
1479
1640
1723
1926
1966
2158
2723
4349
rawdaGastan commented 3 weeks ago

Are you sure those nodes are updated? Can you please check their versions if possible?

scottyeager commented 3 weeks ago

I have reviewed the logs for all nodes in my list above. It seems they all have some issue that's preventing them from updating.

What's common in the logs of all nodes is this line:

[+] identityd: error failed to get flist info error="failed to get flist (tf-zos/zos:production-3:latest.flist) info: 404 Not Found"

Most of the nodes also have an error about read only cache and resulting boltdb failure. For example:

[+] provisiond: fatal exiting error="error running integrity checks: unlinkat /var/cache/modules/provisiond/metrics-diff.bolt: read-only file system"

1087 1226 1479 1640 1723 2158 2723 4349

A couple don't have the read only cache error but instead have an error regarding tfchain, like this:

[+] noded:  error failed to decode events from tfchain error="unable to find field Balances_Locked for event #62 with EventID [20 17]"

1926 1966

Checking now, I see that there's a fix for nodes with read only cache not getting the latest version.

But what about those last two nodes? They are not reporting read only cache, but it seems they have a similar behavior in not accepting the latest version.