Open coesensbert opened 3 weeks ago
The current behavior can't add any faulty node to the farmerbot, you should exclude it
of course .. that seems obvious. But a good node can become a faulty one while running and while the farmer has no idea this is happening. Like what was happening here. Good node .. ssd began to fail .. farmerbot crashed. The point is that when a node goes bad, should the farmerbot stop working for all the other good working nodes?
No, it won't crash. What I see here is the farmerbot crashed while it was adding the farm nodes. If a node becomes faulty while the farmebot is running, it will be ignored.
https://github.com/threefoldtech/tf_operations/issues/2609#issue-2364805326
so these logs are normal? And the fb is still operational? with this error and without displaying the cli dashboard?
These logs show that farmerbot is trying to run every time it fails. Farmerbot doesn't do that, I think it is the way you are trying to run it.
it was running fine, then a node got disk issues, then it started doing this. So it's not the user here trying to run the fb with a faulty node, the node became faulty and the fb started doing this. If it's by design, fine. But no nodes where shutdown/booted after the fb started doing this (while it was already running for weeks)
You should exclude the node, what I see is the farmerbot is trying to start, but it is stuck. Also, you are saying that farmerbot was running, but the logs show that it is starting. I can't see why it stopped.
I have done that, and it's working again. If i didn't do that, no nodes where shutdown or booted until i removed the faulty node. This is by design?
yes the fb was running for weeks. I noticed no nodes where booted anymore so I went over to check the farmerbot logs. In the logs you can see it is stuck, so therefore it is starting again. So it's stopping starting in a loop, without booting or shutting down nodes, because of a node that was good in the past now became bad because of ssd issues.
I agree that we need more robust handling in this case. It doesn't really matter why the bot restarted, could be a reboot of the machine that's hosting it, a crash of the bot for some other reason, getting OOM killed, or whatever.
We've seen other cases where this sequence of events happened and the farmer didn't realize it until their entire farm failed to wake up (and thus they had already lost reward money). It does seem like there could be a trend of the bot restarting coincidentally with errors that prevent it from starting up, but that's just a hunch.
Unless there's a compelling reason to not start the bot in case of errors like this, I think it would be sensible default behavior to just do so. Otherwise it could be added as a new flag or rolled into the existing continue on error flag for errors occurring during operation.
What happened?
we saw the same on testnet & mainnet. If a node is in trouble for whatever reason (disk issues, cpu/ram issues, simply turned off by the owner, ..) it can break the farmerbot. It returns an error and starts over again, like
Error: failed to add node with id 7 with error: failed to get node 7 statistics from rmb with error: stderr: ERROR: can't perform the search: Input/output error
orError: failed to add node with id 10 with error: failed to get node 10 statistics from rmb with error: message signature verification failed: could not verify signature
orError: failed to add node with id 2995 with error: failed to get node 2995 statistics from rmb with error: context deadline exceeded
all three of these examples got the fb in a loop
https://github.com/threefoldtech/tf_operations/issues/2609 https://github.com/threefoldtech/tf_operations/issues/2610
which network/s did you face the problem on?
Test, Main
Twin ID/s
604
Version
testnet: 0.15.10 & mainnet: 0.14.13
Node ID/s
No response
Farm ID/s
1
Contract ID/s
No response
Relevant log output