Closed A-Harby closed 2 months ago
And it happened again to another ZDB.
Can you check zdb logs ? Can be lot of reasons :p If you hit a connection refused, zdb is not listening/is crashed, need logs to know why. Could be no more disk space for example, zdb should not crash but some edge case are maybe not supported.
Can you check zdb logs ? Can be lot of reasons :p If you hit a connection refused, zdb is not listening/is crashed, need logs to know why. Could be no more disk space for example, zdb should not crash but some edge case are maybe not supported.
Can I know how to check ZDB logs? It would be great to know all the ways to get any other logs as well.
These zdbs are running inside of Zos. I don't know if those logs get shipped over Loki with the rest of the node logs. At least, I'm not able to find any in some searches now and I don't recall ever seeing them.
In this case though I think we should be looking at the network connectivity as the first and most likely failure point. The IP addresses starting with 300
and 301
are Yggdrasil IPs. Taking Yggdrasil out of the picture would be a good first step since we know the performance and availability aren't consistent.
For the rest those are connecting over Mycelium. I checked the logs from a couple of the nodes causing the dropouts (devnet). With node 159 I found a lot of errors regarding connecting to Mycelium peers and also failing the network health checks in general. So I wonder if the node maybe just generally doesn't have a healthy network.
My suggestions to help narrow this down would be:
2...
) to eliminate the possibility of Mycelium related issuesIf network connectivity can reasonably be ruled out as the root cause, that's when I'd go looking for potential issues in zdb itself.
I have seen some different results after using IPV6 to connect instead of ygg or mycelium.
So maybe we should make the connection with IPV6 until we can get the ygg and mycelium stable for the ZDB.
I think this is explained by https://github.com/threefoldtech/zos/issues/2403
So not really a Zdb issue. I think we can close this one, if that's okay with you @A-Harby.
I agree it can be closed as long as it is tracked in another issue https://github.com/threefoldtech/zos/issues/2403.
I have a few questions. First, why would a ZDB go down? What could be the reason?
Second, why would a zdb status be up, then down, then up again?