threefoldtech / tfgrid-sdk-go

Apache License 2.0
2 stars 4 forks source link

🐞 [Bug]:Farmerbot fails to start fully due to an RMB communication error. #1191

Open mahendravarmayadala93 opened 2 months ago

mahendravarmayadala93 commented 2 months ago

What happened?

Farm ID : 195

The client reported that his Nodes managed by Farmerbot did not shut down.

Upon reviewing the log file, we found that the farmerbot was not starting up due to the following error:

8:20AM DBG failed to read message error="websocket: close 1006 (abnormal closure): unexpected EOF"
8:20AM DBG connecting url=wss://relay.grid.tf

Some additional notes by @scottyeager:

I checked the log file. The core thing here is that the bot never fully starts up. That is indeed due to the failure of RMB communication associated with this error

We see on each attempt of the bot to start that it adds one node successfully using the same RMB relay before failing repeatedly on the second node. So there is some successful RMB communication happening

I also checked on the rate limiting implementation for RMB. It looks like it only drops messages with an error, it isn't supposed to drop connections entirely if the user tries to send too many messages

Log File :

farmerbot_16enuun.log

which network/s did you face the problem on?

Main

Twin ID/s

No response

Version

No response

Node ID/s

626, 548, 547(Offline currently) - 3038(Online)

Farm ID/s

195

Contract ID/s

No response

Relevant log output

Config File

farm_id: 195
never_shutdown_nodes:
  - 626
power:
  periodic_wake_up_start: 09:00AM
  periodic_wake_up_limit: 3
rawdaGastan commented 1 month ago
TullysInc commented 6 days ago

@rawdaGastan : There is a more recent report from a second farmer (farmID_250), about the same error lines in the logs he obtained.

farmer@bot:~/farmerbot$ tail -n 50 farmerbot.log
2024/11/18 14:08:47 Connecting to wss://tfchain.grid.tf:443...
2:08PM INF starting peer session=farmerbot-rpc-250 twin=826
2:08PM DBG connecting url=wss://tfchain.grid.tf/ws
2024/11/18 14:08:49 Connecting to wss://tfchain.grid.tf/ws...
2:08PM DBG connecting url=wss://relay.grid.tf
2:08PM DBG Add node nodeID=3736
2:08PM DBG failed to read message error="websocket: close 1006 (abnormal closure ): unexpected EOF"
2:08PM DBG connecting url=wss://relay.grid.tf
2:08PM DBG Add node nodeID=4746
2:08PM DBG failed to read message error="websocket: close 1006 (abnormal closure ): unexpected EOF"

All nodes included in this config are currently up in the dashboard, so we can possibly rule out the suspicion of the nodes being unhealthy before being added to the farmerbot. Also, the --continue-power-on-error is already included in the script that was used to set up.

farm_id: 250 included_nodes:

scottyeager commented 4 days ago
  • You can try to use --continue-power-on-error flag

We are advising all farmers to use this flag, but it doesn't seem to help in every case. Aside from the EOF error above, regular timeouts while trying to reach powered off nodes also seem to block the bot from starting, for example:

error :
9:13PM FTL error="failed to add node with id 2950 with error: failed to get node 2950 statistics from rmb with error: context deadline exceeded"

@rawdaGastan, can you clarify the expected behavior with --continue-power-on-error?

rawdaGastan commented 1 day ago

this flag --continue-power-on-error allows the farmerbot to continue updating nodes and managing them even some nodes have errors in RMB connection. Otherwise farmerbot won't be able to start if the flag is not set and some nodes have issues with RMB

It is expected that nodes cannot communicate through RMB when they are offline.

scottyeager commented 3 hours ago

this flag --continue-power-on-error allows the farmerbot to continue updating nodes and managing them even some nodes have errors in RMB connection. Otherwise farmerbot won't be able to start if the flag is not set and some nodes have issues with RMB

This matches what we expected. The thing then is that we are seeing various cases where the bot does not start due to RMB error, despite the --continue-power-on-error flag being passed. So that's why I was trying to clarify if there's still some case that should cause the bot to refuse to start due to RMB failures with the flag present.

Assuming no such case exists, our issue is that the bot is still refusing to start with --continue-power-on-error.

scottyeager commented 2 hours ago

It is expected that nodes cannot communicate through RMB when they are offline.

These errors are coming from online nodes. I'm also not sure what the severity of these errors is. I did some searching regarding EOF error for websockets and found this:

The error indicates that the peer closed the connection without sending a close message. The RFC calls this "abnormal closure", but the error is normal to receive.

But I also found some different suggestions about adjusting timeouts over reverse proxies, etc. So I guess it would also be good to clarify if this EOF error is something that we should be concerned with addressing.