Zookeeper: Liveness and Readiness Probes are Failing

WalterMoar commented 2 years ago

When deployed in OpenShift 4.9 we're getting recurring probe failures for all Zookeeper pods:

In ss-nifi-zookeeper.yaml the probes are defined as echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok. By repeatedly calling echo ruok | nc localhost 2181 in a pod I'm able to reproduce the problem, and it is not due to timeouts: the ruok command is returning a blank response, but within less than a second. Since the failureThreshold for the probes is 6, there are no restarts of the pods. It's good that there are no restarts, but bad that it is frequently failing the probes.

Command to run in pod to look for empty responses:

#!/bin/bash

while true; do
    BEFORE=$(date)
    RESPONSE=$(echo ruok | nc localhost 2181)
    RETVAL=$?
    AFTER=$(date)

    if [ -z "$RESPONSE" ]; then
        echo $RETVAL $BEFORE $AFTER empty!
    elif [ $RESPONSE != imok ]; then
        echo $RETVAL $BEFORE $AFTER $RESPONSE
    fi

    # sleep 0.25
done

The output of the script looks like:

0 Tue Jul 5 15:35:01 UTC 2022 Tue Jul 5 15:35:01 UTC 2022 empty!
0 Tue Jul 5 15:38:02 UTC 2022 Tue Jul 5 15:38:02 UTC 2022 empty!
0 Tue Jul 5 15:45:08 UTC 2022 Tue Jul 5 15:45:09 UTC 2022 empty!

I also tried running nc with verbose output and it responds with Connection to localhost (::1) 2181 port [tcp/*] succeeded! even when there is no response from Zookeeper, so it's not a problem creating the connection. I also tried 127.0.0.1 instead of localhost, but the same problems happen.

Why is zookeeper not returning a response for the ruok command?
Is there something else that could be used for the probes?

WalterMoar commented 2 years ago

In the nifi-zookeeper statefulset I set the _ZOO_LOGLEVEL environment variable from ERROR to INFO. This gives logging like:

2022-07-19 19:17:20,427 [myid:1] - INFO  [NIOWorkerThread-1:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:42920
2022-07-19 19:17:20,429 [myid:1] - INFO  [NIOWorkerThread-2:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:42922
2022-07-19 19:17:30,429 [myid:1] - INFO  [NIOWorkerThread-1:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:43622
2022-07-19 19:17:30,432 [myid:1] - INFO  [NIOWorkerThread-2:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:43626
2022-07-19 19:17:40,329 [myid:1] - INFO  [NIOWorkerThread-1:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:44294
2022-07-19 19:17:40,331 [myid:1] - INFO  [NIOWorkerThread-2:NIOServerCnxn@514] - Processing ruok command from /0:0:0:0:0:0:0:1:44298

Every 10s it is running both the liveness and readiness probes. I waited until one of the probes failed, but the logs remain the same - the server is always getting the ruok command. I tried setting the logging to DEBUG and then TRACE but no additional logging happens.

The server is always getting the ruok command and is not producing an error in the logs. Why is the response not being returned?

WalterMoar commented 2 years ago

It seems that echoing data into nc has the effect of sending the TCP FIN flag, and the connection is sometimes dropped before a response is received. I see a Kafka PR where the -q 1 flag is used to wait for a server response before sending the FIN flag. It has reduced probe failures from 200 per day to around 40, but 40 is still too high. Continuing to look at the problem, as the final solution may make the -q 1 flag unnecessary.

guimou commented 2 years ago

Hey! Sorry, I did not have much time to put into that yet. But last week I set it up in my Dev environment and let it run for a few days. I did not have any single error. Do you have some heavy load on the node, network, or anything that could explain this?

WalterMoar commented 2 years ago

No worries, and thank you for your response!

We're running in a cluster with 40+ application nodes with many, many namespaces and pods running in it. There is a maximum load allowed per node, so we shouldn't be bound by CPU or anything like that. The probes are going against localhost, so I don't think the network should affect them - but as a tenant in the cluster with limited permissions, it's hard for us to know what the network load is.

WalterMoar commented 2 years ago

I appear to have found a stable configuration by using the -q 1 flag plus excessively high CPU and memory limits. So a fix is possible, but I need to figure out how low I can make the limits. In a week or so I should have a reasonable set of limits sorted out.

guimou / nifi-on-openshift

Zookeeper: Liveness and Readiness Probes are Failing #6