near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.34k stars 632 forks source link

Misconfigured validator `account_id`s do not output any warnings to the operator #8771

Open nagisa opened 1 year ago

nagisa commented 1 year ago

When setting up a mainnet fork, I accidentally put a wrong account_id of the validator into validator_key.json. While the periodic stats messages did indicate that this node is not operating as a validator, there was no other indication, error or warning explaining what might have gone wrong.

We do have a message like this on startup:

2023-03-21T17:39:38.134516Z  INFO client: Starting validator node: {validator's account_id}

neard also clearly knows that it is being a validator or not (as evidenced by it being able to put the Validator | snippet in the periodic target: stats event). We should be able to figure out that something has gone wrong in this case and output at least a warning message (ideally explaining why.)

marcelo-gonzalez commented 1 year ago

As I see it, the problem is that it's hard for neard to know that there actually is a problem, because what if I actually meant to use that account ID? If the network was initialized with 2 validators, node0 and node1, and I start a node with a validator key that shows "account_id": "nodee0", it's clear that I just made a mistake. But how can neard generate a warning saying something like WARN: you made a typo in your validator key file, when I might actually have meant to use that account? Maybe what we could do is amend the Starting validator node message to be more like:

2023-03-21T17:39:38.134516Z  INFO client: Starting validator node: {validator's account_id}. This account is not a current block or chunk producer.
nagisa commented 1 year ago

In my case the account didn’t exist at all anywhere in the network, and I’d imagine there are plenty of other failure modes – correct account_id but an invalid public key for example; or public/private key pair being invalid. My intuition is that cases like these should be straightforward to check even against just local rocksdb copy (it stores the list of validators, doesn't it?). In this case false positives are probably fine as well – the only outcome is a log message inviting the operator to double check.

That said all this might be overcomplicating the solution. We have Validator | in the stats message. Would adding NOT Validator | if the node has validator_key.json set up but isn’t validating be sufficient? Possible reasons for why the message shows up like that could be documented separately at that point, we could write a troubleshooting guide at that point or whatever.