threefoldtech / 0-db

Fast write ahead persistent redis protocol key-value store
Apache License 2.0
39 stars 10 forks source link

ZDB status not stable with Yggdrasil #167

Closed A-Harby closed 2 months ago

A-Harby commented 3 months ago

I have a few questions. First, why would a ZDB go down? What could be the reason? image

{
  "version": 0,
  "twin_id": 162,
  "contract_id": 132880,
  "metadata": "{\"version\":3,\"type\":\"vm\",\"name\":\"node159meta\",\"projectName\":\"node159meta\"}",
  "description": "",
  "expiration": 0,
  "signature_requirement": {
    "requests": [
      {
        "twin_id": 162,
        "required": false,
        "weight": 1
      }
    ],
    "weight_required": 1,
    "signatures": [
      {
        "twin_id": 162,
        "signature": "6ed02e414562292e50b34f1ae325bd4597bb364bb0d4b85d47a8ef9a82bbb557d21d42ee9cd9b959c381ead802728b36659e4490eb17612d9e3e453c0cfe3b83",
        "signature_type": "sr25519"
      }
    ],
    "signature_style": ""
  },
  "workloads": [
    {
      "version": 0,
      "name": "node159meta0",
      "type": "zdb",
      "data": {
        "size": 1073741824,
        "mode": "user",
        "password": "password",
        "public": false
      },
      "metadata": "",
      "description": "",
      "result": {
        "created": 1721637882,
        "state": "ok",
        "message": "",
        "data": {
          "Namespace": "162-132880-node159meta0",
          "IPs": [
            "2a02:1802:5e:14:90cf:5dff:fe6f:3e58",
            "300:cb94:a268:f50:b062:7056:cca5:43e5",
            "4a9:556a:be87:fac0:98c4:a2d9:c7c3:6973"
          ],
          "Port": 9900
        }
      }
    }
  ]
}

Second, why would a zdb status be up, then down, then up again? image

A-Harby commented 3 months ago

And it happened again to another ZDB. image

maxux commented 3 months ago

Can you check zdb logs ? Can be lot of reasons :p If you hit a connection refused, zdb is not listening/is crashed, need logs to know why. Could be no more disk space for example, zdb should not crash but some edge case are maybe not supported.

A-Harby commented 3 months ago

Can you check zdb logs ? Can be lot of reasons :p If you hit a connection refused, zdb is not listening/is crashed, need logs to know why. Could be no more disk space for example, zdb should not crash but some edge case are maybe not supported.

Can I know how to check ZDB logs? It would be great to know all the ways to get any other logs as well.

scottyeager commented 3 months ago

These zdbs are running inside of Zos. I don't know if those logs get shipped over Loki with the rest of the node logs. At least, I'm not able to find any in some searches now and I don't recall ever seeing them.

In this case though I think we should be looking at the network connectivity as the first and most likely failure point. The IP addresses starting with 300 and 301 are Yggdrasil IPs. Taking Yggdrasil out of the picture would be a good first step since we know the performance and availability aren't consistent.

For the rest those are connecting over Mycelium. I checked the logs from a couple of the nodes causing the dropouts (devnet). With node 159 I found a lot of errors regarding connecting to Mycelium peers and also failing the network health checks in general. So I wonder if the node maybe just generally doesn't have a healthy network.

My suggestions to help narrow this down would be:

  1. Also test using the IPv6 addresses (those starting with 2...) to eliminate the possibility of Mycelium related issues
  2. Try some nodes on testnet or mainnet. Since we know that devnet nodes are not necessarily the fittest hardware and are sometimes heavily loaded

If network connectivity can reasonably be ruled out as the root cause, that's when I'd go looking for potential issues in zdb itself.

A-Harby commented 3 months ago

I have seen some different results after using IPV6 to connect instead of ygg or mycelium.

image

So maybe we should make the connection with IPV6 until we can get the ygg and mycelium stable for the ZDB.

scottyeager commented 2 months ago

I think this is explained by https://github.com/threefoldtech/zos/issues/2403

So not really a Zdb issue. I think we can close this one, if that's okay with you @A-Harby.

A-Harby commented 2 months ago

I agree it can be closed as long as it is tracked in another issue https://github.com/threefoldtech/zos/issues/2403.