threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
83 stars 14 forks source link

VM network not okay after node reboot #2101

Open scottyeager opened 1 year ago

scottyeager commented 1 year ago

I've had a VM running on node 426, and noticed today that it wasn't functioning as expected. When I pull up its details in the Playground, I see a couple of networking related errors:

When I try to ssh to the yggdrasil IP, it hangs.

Checking the node's logs, I do see that it was recently rebooted. The timing seems coincident with the failure.

Here is the full JSON output from the Playground regarding the contract:

{
  "version": 0,
  "twin_id": 18,
  "contract_id": 38884,
  "metadata": "{\"type\":\"vm\",\"name\":\"VM4fd17dde\",\"projectName\":\"\"}",
  "description": "",
  "expiration": 0,
  "signature_requirement": {
    "requests": [
      {
        "twin_id": 18,
        "required": false,
        "weight": 1
      }
    ],
    "weight_required": 1,
    "signatures": [
      {
        "twin_id": 18,
        "signature": "5a525ff5c77cdf23ba445bda59419879e2623c57ab1def36b86e48b4b0cf796b41c5be6684095e9efd6b46e006ea64d8c76c24ceaf1fbbd384bea29df6b7068d",
        "signature_type": "sr25519"
      }
    ],
    "signature_style": ""
  },
  "workloads": [
    {
      "version": 0,
      "name": "DISK94f35bc4",
      "type": "zmount",
      "data": {
        "size": 10737418240
      },
      "metadata": "{\"type\":\"vm\",\"name\":\"VM4fd17dde\",\"projectName\":\"\"}",
      "description": "",
      "result": {
        "created": 1698771470,
        "state": "ok",
        "message": "",
        "data": {
          "volume_id": "18-38884-DISK94f35bc4"
        }
      }
    },
    {
      "version": 0,
      "name": "VM4fd17dde",
      "type": "zmachine",
      "data": {
        "flist": "https://hub.grid.tf/tf-official-apps/threefoldtech-ubuntu-22.04.flist",
        "network": {
          "planetary": true,
          "interfaces": [
            {
              "network": "NW11c3c73c",
              "ip": "10.20.2.2"
            }
          ],
          "public_ip": "VM4fd17dde_pubip"
        },
        "size": 2147483648,
        "mounts": [
          {
            "name": "DISK94f35bc4",
            "mountpoint": "/mnt/"
          }
        ],
        "entrypoint": "/sbin/zinit init",
        "compute_capacity": {
          "cpu": 1,
          "memory": 1073741824
        },
        "env": {
          "SSH_KEY": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIWrf27DfpoeyVrktHN+fIQ5WJOTxa/dhpvh+6xwtsT9 tskey-auth-k4Qo6y3CNTRL-BSfQA25YKsJP3dZ8hA9UtJwqwKc7Ecm8M"
        },
        "corex": false
      },
      "metadata": "{\"type\":\"vm\",\"name\":\"VM4fd17dde\",\"projectName\":\"\"}",
      "description": "",
      "result": {
        "created": 1698771470,
        "state": "error",
        "message": "could not get public ip config: public ip workload is not okay",
        "data": {
          "id": "18-38884-VM4fd17dde",
          "ip": "10.20.2.2",
          "ygg_ip": "",
          "console_url": ""
        }
      }
    },
    {
      "version": 0,
      "name": "VM4fd17dde_pubip",
      "type": "ip",
      "data": {
        "v4": false,
        "v6": true
      },
      "metadata": "{\"type\":\"vm\",\"name\":\"VM4fd17dde\",\"projectName\":\"\"}",
      "description": "",
      "result": {
        "created": 1698771470,
        "state": "error",
        "message": "could not look up ipv6 prefix: no public ipv6 found",
        "data": {
          "ip": "",
          "ip6": "",
          "gateway": ""
        }
      }
    },
    {
      "version": 0,
      "name": "NW11c3c73c",
      "type": "network",
      "data": {
        "subnet": "10.20.2.0/24",
        "ip_range": "10.20.0.0/16",
        "wireguard_private_key": "pEsI6G2HYpM6UXZ8Jjh7u/8PBJ4EnITuDxS/z3+VlgE=",
        "wireguard_listen_port": 3470,
        "node_id": 426,
        "peers": []
      },
      "metadata": "{\"type\":\"vm\",\"name\":\"VM4fd17dde\",\"projectName\":\"\"}",
      "description": "",
      "result": {
        "created": 1698771470,
        "state": "ok",
        "message": "",
        "data": null
      }
    }
  ]
}
muhamadazmy commented 1 year ago

Yeah, this means probably the node did not get ipv6 after reboot. Can be issue with the router again. When that happened the workloads failed to reconfigure.

scottyeager commented 1 year ago

I see. It would be nice if the Wireguard and Yggdrasil interfaces could survive in this case.