pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 206 forks source link

zookeeper ruok command check occasionally fails #475

Closed alfred-landrum closed 2 years ago

alfred-landrum commented 2 years ago

Description

The zookeeper liveness and readiness checks use the zookeeper ruok command like so:

OK=$(echo ruok | nc 127.0.0.1 $CLIENT_PORT)

# Check to see if zookeeper service answers
if [[ "$OK" == "imok" ]]; then
...
fi

However, if I just sit in a shell and try manually running that, eventually, I'll very occasionally get a blank response instead of the expected imok, even though the service seems fine, and running the command immediately again returns the expected string.

I believe the root issue is that we occasionally see a TCP reset from the zookeeper command socket:

image

From looking at the sequence numbers, I think it receives an ack from the client side after it's started shutting down the socket state. Even though the imok response is in flight, or already in the tcp receive buffers, nc doesn't read the socket data & write to its standard out.

I replaced nc with socat, and then ran a test on the zookeeper instance. I temporarily disabled the liveness and readiness checks, and with a tcpdump running, ran a script that tried to run the checks until failures. Using just nc would fail in a few seconds; using nc -q 0 (close write side as soon as possible) would fail in 10s of seconds. Using just plain socat (echo ruok | socat stdio tcp4:127.0.0.1:2181) would not fail, though via the tcpdump, I could verify that it had seen TCP reset packets.

We're updating our local zookeeper image to install and use socat instead of nc.

Importance

Under a loaded system, this has sometimes caused the liveness probe to fail with enough frequency to cause kubernetes to kill the pod, so I think its important that this be resolved.

Suggestions for an improvement

Update the zookeeper probes to install and use socat instead of nc.

Slach commented 2 years ago

@alfred-landrum thanks a lot for your report, I struggle with the same issue I hope your PR will merge and include in next release

@nishant-yt could you look to this PR?

karlwowza commented 2 years ago

We have the same problem in our K8s cluster. It would be great to get this PR reviewed and merged. @alfred-landrum thanks for your effort!

dariothornhill commented 2 years ago

Is there any update on when this will be merged?

mailsonsantana commented 2 years ago

Same problem, is there any update on when this will be merged?