threefoldtecharchive / farmerbot

ability to manage a farm
Apache License 2.0
4 stars 0 forks source link

Nodes not powering off - farm 2209 #35

Closed scottyeager closed 1 year ago

scottyeager commented 1 year ago

Farmer recently set up the farmerbot and has some nodes that should be eligible to shutdown. The nodes aren't shutting down as expected.

Here's the log file: https://gist.github.com/scottyeager/3720726963a78870a4101ff2b447137c

Notable excerpt from the bottom of the file:

2023-05-08 18:37:17 [INFO ] [POWERMANAGER] Executing job: POWERON 4000
2023-05-08 18:37:31 [DEBUG] Received result for job with guid cbd29e28-e378-46c4-af51-f17e9c557549
2023-05-08 18:37:31 [DEBUG] Returned job cbd29e28-e378-46c4-af51-f17e9c557549
2023-05-08 18:37:31 [INFO ] Elapsed time for update: 0.6467387333333333
2023-05-08 18:41:57 [INFO ] [DATAMANAGER] Node 3616 is waking up.
2023-05-08 18:42:02 [INFO ] [DATAMANAGER] Node 3690 is waking up.
2023-05-08 18:42:07 [INFO ] [DATAMANAGER] Node 4000 is waking up.
2023-05-08 18:42:12 [ERROR] [DATAMANAGER] Node 4381 wakeup was unsuccessful. Putting its state back to off.
2023-05-08 18:42:17 [ERROR] [DATAMANAGER] Node 4382 wakeup was unsuccessful. Putting its state back to off.

Similar messages are repeated through the logs, indicating unsuccessful wakeups for different nodes. According to GraphQL, they are Up/Up:

image

I tried pinging 4381 and 4382 over RMB and they responded immediately.

brandonpille commented 1 year ago

That looks like it is not receiving an answer (in time) from the nodes. The farmerbot sends rmb messages to check the status of the node. If the node doesn't respond it assumes it is off. If the farmerbot assumes a different status of the node it shows an error. For example if it set the target to powering on and the node is still not reacheable after 30 minutes. What happened here is that it assumed the node to be ON but didn't get an answer from it so it returned an error. Actually it received no asnwer from any of the nodes. So I think something is wrong with the rmb-peer here.

Can I get the logs from all containers? docker compose logs > farmerbot.log

scottyeager commented 1 year ago

I've requested these logs and will report back when the farmer provides them.

brandonpille commented 1 year ago

Hi @scottyeager. Did you get the logs?

scottyeager commented 1 year ago

Hi @brandonpille, here are the logs from all containers: https://gist.github.com/scottyeager/357f87e534cf054a676f417e2481cdd6

I noticed that rmbpeer reported reset connection a couple times, but after that it looks normal. I asked the farmer to restart the farmerbot and provide another log file.

scottyeager commented 1 year ago

And here's the new log after restarting the bot: https://gist.github.com/scottyeager/0eacd9419ae78fe000af73bafaaabbf2

scottyeager commented 1 year ago

Hi @brandonpille, any update on this?

brandonpille commented 1 year ago

It looks like rmbpeer is not working properly here. Can I see the output of docker container ls --all?

brandonpille commented 1 year ago

Can he try to restart the rmb-peer. First find the container name of the peer in the output of docker container ls. It should end with "-rmbpeer-1". Then do: docker container restart

scottyeager commented 1 year ago

Hi @brandonpille,

The farmer updated to the latest version of the containers and thus restarted the bot completely in the process. Now he reports that one node (4382) is going to sleep as expected, while the others are not. Here is an updated set of logs from all the containers:

https://gist.github.com/scottyeager/4d8d55a4af84b09593d146005e7f6764

brandonpille commented 1 year ago

I see a lot of messages: "received reply of an expired message" which means the answers from RMB that the farmerbot requires are not back in time (5 seconds). So that tells me that the issue https://github.com/threefoldtech/farmerbot/issues/30 is getting more and more urgent.

brandonpille commented 1 year ago

This should be fixed with the new release