threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
85 stars 14 forks source link

Node depletes its TFT on InterfaceIpTooShort error #2419

Open scottyeager opened 1 month ago

scottyeager commented 1 month ago

A farmer reported that their node 2358 on mainnet no longer appeared online in the dashboard. I pulled the node logs and saw ongoing errors that the node failed to register with "extrinsic temporarily banned". Since that's caused by insufficient wallet balance for the node, I checked and saw that the node had only 0.0018 TFT and was therefore likely unable to complete transactions.

Then I funded the node wallet with 0.1 TFT and asked the farmer to boot the node up again. This time I was able to observe what had caused the node to drain its wallet:

[+] noded: 2024-09-12T19:18:54Z warn registration failed error="failed to register node: failed to update node data with id: 2358: failed to update node: InterfaceIpTooShort" sleep=342.453987ms

Just scanning the node logs, I don't see any obvious cause for this error. Maybe there's some clue in here:

[+] networkd/test: ## Status for  network
[-] dhcp-npub4: npub4: adding default route via 192.168.68.1
[-] dhcp-npub4: npub4: adding route to 192.168.68.0/24
[-] dhcp-npub4: npub4: leased 192.168.68.111 for 7200 seconds
[+] networkd/test: 2024-09-12T19:18:25Z fatal exiting error="context deadline exceeded"
[+] networkd/test: 2024-09-12T19:18:25Z error failed to get status for module error="context deadline exceeded" module=network
[+] networkd/test: ## Status for  network
[-] dhcp-npub4: npub4: probing address 192.168.68.111/24
[-] dhcp-npub4: npub4: soliciting an IPv6 router
[-] dhcp-npub4: npub4: offered 192.168.68.111 from 192.168.68.1
[-] dhcp-npub4: npub4: soliciting a DHCP lease
[+] networkd/test: 2024-09-12T19:18:19Z fatal exiting error="context deadline exceeded"
[+] networkd/test: 2024-09-12T19:18:19Z error failed to get status for module error="context deadline exceeded" module=network
[+] api-gateway: 2024-09-12T19:18:19Z info starting api-gateway module broker=unix:///var/run/redis.sock worker nr=1
[+] api-gateway: 2024-09-12T19:18:19Z info starting peer session= twin=3998
[-] dhcp-npub4: ipv6_addaddr1: Permission denied
[-] dhcp-npub4: npub4: adding address fe80::20bf:7ae4:8b36:75e2
[-] dhcp-npub4: npub4: IAID be:41:2e:c7
[-] dhcp-npub4: DUID 00:04:99:59:a1:a8:a1:d6:00:00:00:00:00:00:00:00:00:00

We have two problems here:

  1. The node is apparently submitting an invalid IP address to TF Chain (the error is triggered by a simple check that the address is longer than seven characters)
  2. The node continues forever in a way that causes its wallet to get drained (I thought fees for node transactions got refunded? Maybe this is just solved by preventing Zos from trying to do this in the first place)
iwanbk commented 1 month ago

The node is apparently submitting an invalid IP address to TF Chain (the error is triggered by a simple check that the address is longer than seven characters)

@scottyeager is it possible for you to activate debug log? if possible, we could get more info by looking at this log

"node data have changing, issuing an update node:

https://github.com/threefoldtech/zos/blob/v3.11.3/pkg/registrar/register.go#L210

Probably the IP address itself is empty.

Now i'm looking for the possibilities of this empty IP address.

iwanbk commented 1 month ago

Probably the IP address itself is empty.

if possible, give the result of this command

ip addr
scottyeager commented 1 month ago

Hi @iwanbk, this is a node running on mainnet, so there's no chance of debug logging or running any commands via SSH. We might be able to ask the farmer to try booting the node to devnet so some dev can get SSH access. That is assuming the node displays the same behavior.

scottyeager commented 6 days ago

This appears to have resolved itself for the case at hand, but I still think it's worth looking into how we can prevent it from happening again.

iwanbk commented 6 days ago

but I still think it's worth looking into how we can prevent it from happening again.

It would be tricky to fix/prevent something when we don't know the root cause yet. But let me check again.

iwanbk commented 6 days ago

From what i can see, the error could be caused by empty/invalid IP address of the bridge.

Some improvements we probably could do is improving this part https://github.com/threefoldtech/zos/blob/9998be1c6c66c387c106bdee7232cfe7f768ccf4/pkg/registrar/register.go#L113-L116

  1. change the hardcoded zos to a constant. Other than preventing typo, a constant with good name could give more meaning to the code

  2. Check that the returned zosIps, make sure that the length of the string representation is >= 7. In this way, we duplicating the checking in both client(zos) and server (tfchain), but i think it is OK because:

    • IPv4 with length >= 7 is a common thing, not specific to tfchain
    • it is also common and good way for client to check it's data input before sending request to server, to not waste both resources

Doing IP checking will not solve the real issue, but at least we could have better visibility and avoid wasting TFT