threefoldtech / tfgrid-sdk-go

Apache License 2.0
2 stars 4 forks source link

🐞 [Bug]: faulty zos node can break farmerbot #1100

Open coesensbert opened 3 weeks ago

coesensbert commented 3 weeks ago

What happened?

we saw the same on testnet & mainnet. If a node is in trouble for whatever reason (disk issues, cpu/ram issues, simply turned off by the owner, ..) it can break the farmerbot. It returns an error and starts over again, like

Error: failed to add node with id 7 with error: failed to get node 7 statistics from rmb with error: stderr: ERROR: can't perform the search: Input/output error or Error: failed to add node with id 10 with error: failed to get node 10 statistics from rmb with error: message signature verification failed: could not verify signature or Error: failed to add node with id 2995 with error: failed to get node 2995 statistics from rmb with error: context deadline exceeded

all three of these examples got the fb in a loop

https://github.com/threefoldtech/tf_operations/issues/2609 https://github.com/threefoldtech/tf_operations/issues/2610

which network/s did you face the problem on?

Test, Main

Twin ID/s

604

Version

testnet: 0.15.10 & mainnet: 0.14.13

Node ID/s

No response

Farm ID/s

1

Contract ID/s

No response

Relevant log output

4:16PM INF Welcome to farmerbot (v0.15.10), Farmerbot is starting up...
4:16PM DBG connecting url=wss://tfchain.test.grid.tf/ws
2024/06/20 16:16:24 Connecting to wss://tfchain.test.grid.tf/ws...
4:16PM DBG connecting url=wss://tfchain.test.grid.tf:443
2024/06/20 16:16:24 Connecting to wss://tfchain.test.grid.tf:443...
4:16PM INF starting peer session=farmerbot-rpc-1 twin=2
4:16PM DBG connecting url=wss://tfchain.test.grid.tf/ws
2024/06/20 16:16:24 Connecting to wss://tfchain.test.grid.tf/ws...
4:16PM DBG connecting url=wss://relay.test.grid.tf
4:16PM DBG Add node nodeID=10
4:16PM DBG Add node nodeID=20
4:16PM DBG Add node nodeID=189
4:16PM WRN Updating power, Power target is waking up nodeID=189
4:16PM WRN Node state is off, will skip rmb calls nodeID=189
4:16PM DBG Add node nodeID=16
4:16PM DBG Add node nodeID=11
4:16PM DBG Add node nodeID=196
4:16PM WRN Updating power, Power target is off nodeID=196
4:16PM WRN Node state is off, will skip rmb calls nodeID=196
4:16PM DBG Add node nodeID=14
4:16PM DBG Add node nodeID=19
4:16PM DBG Add node nodeID=191
4:16PM DBG Add node nodeID=15
4:16PM DBG Add node nodeID=199
4:16PM DBG Add node nodeID=2
4:16PM DBG Add node nodeID=74
4:16PM WRN Updating power, Power target is off nodeID=74
4:16PM WRN Node state is off, will skip rmb calls nodeID=74
4:16PM DBG Add node nodeID=13
4:16PM DBG Add node nodeID=188
4:16PM WRN Updating power, Power target is off nodeID=188
4:16PM WRN Node state is off, will skip rmb calls nodeID=188
4:16PM DBG Add node nodeID=194
4:16PM DBG Add node nodeID=5
4:16PM DBG Add node nodeID=18
4:16PM DBG Add node nodeID=7
Error: failed to add node with id 7 with error: failed to get node 7 statistics from rmb with error: stderr: ERROR: can't perform the search: Input/output error
: exit status 1
Usage:
  farmerbot run [flags]

Flags:
  -c, --config string             enter your config file that includes your farm, node and power configs. Allowed format is yml/yaml
      --continue-power-on-error   when set, the farmerbot will run even if there was an error powering on some of the nodes
  -h, --help                      help for run

Global Flags:
  -d, --debug             by setting this flag the farmerbot will print debug logs too
  -e, --env string        enter your env file that includes your NETWORK and MNEMONIC_OR_SEED
  -k, --key-type string   key type for mnemonic (default "sr25519")
  -m, --mnemonic string   the mnemonic of the account of the farmer
  -n, --network string    the grid network to use, available networks: dev, qa, test, and main (default "main")
  -s, --seed string       the hex seed of the account of the farmer

4:16PM FTL error="failed to add node with id 7 with error: failed to get node 7 statistics from rmb with error: stderr: ERROR: can't perform the search: Input/output error\n: exit status 1"
rawdaGastan commented 3 weeks ago

The current behavior can't add any faulty node to the farmerbot, you should exclude it

coesensbert commented 3 weeks ago

of course .. that seems obvious. But a good node can become a faulty one while running and while the farmer has no idea this is happening. Like what was happening here. Good node .. ssd began to fail .. farmerbot crashed. The point is that when a node goes bad, should the farmerbot stop working for all the other good working nodes?

rawdaGastan commented 3 weeks ago

No, it won't crash. What I see here is the farmerbot crashed while it was adding the farm nodes. If a node becomes faulty while the farmebot is running, it will be ignored.

coesensbert commented 3 weeks ago

https://github.com/threefoldtech/tf_operations/issues/2609#issue-2364805326

so these logs are normal? And the fb is still operational? with this error and without displaying the cli dashboard?

rawdaGastan commented 3 weeks ago

These logs show that farmerbot is trying to run every time it fails. Farmerbot doesn't do that, I think it is the way you are trying to run it.

coesensbert commented 3 weeks ago

it was running fine, then a node got disk issues, then it started doing this. So it's not the user here trying to run the fb with a faulty node, the node became faulty and the fb started doing this. If it's by design, fine. But no nodes where shutdown/booted after the fb started doing this (while it was already running for weeks)

rawdaGastan commented 3 weeks ago

You should exclude the node, what I see is the farmerbot is trying to start, but it is stuck. Also, you are saying that farmerbot was running, but the logs show that it is starting. I can't see why it stopped.

coesensbert commented 3 weeks ago

I have done that, and it's working again. If i didn't do that, no nodes where shutdown or booted until i removed the faulty node. This is by design?

yes the fb was running for weeks. I noticed no nodes where booted anymore so I went over to check the farmerbot logs. In the logs you can see it is stuck, so therefore it is starting again. So it's stopping starting in a loop, without booting or shutting down nodes, because of a node that was good in the past now became bad because of ssd issues.

scottyeager commented 10 hours ago

I agree that we need more robust handling in this case. It doesn't really matter why the bot restarted, could be a reboot of the machine that's hosting it, a crash of the bot for some other reason, getting OOM killed, or whatever.

We've seen other cases where this sequence of events happened and the farmer didn't realize it until their entire farm failed to wake up (and thus they had already lost reward money). It does seem like there could be a trend of the bot restarting coincidentally with errors that prevent it from starting up, but that's just a hunch.

Unless there's a compelling reason to not start the bot in case of errors like this, I think it would be sensible default behavior to just do so. Otherwise it could be added as a new flag or rolled into the existing continue on error flag for errors occurring during operation.