Add curl timeout to the rgw-probe.sh healtchcheck script

kayrus commented 2 weeks ago

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

When there is a network hiccup, the bash scripts with a curl probe start to accumulate. By default curl has a high connect timeout, e.g.

$ time curl 8.8.8.8
curl: (28) Failed to connect to 8.8.8.8 port 80 after 134639 ms: Connection timed out

real    2m14,677s
user    0m0,007s
sys     0m0,032s

And the probes are:

        failureThreshold: 3
        initialDelaySeconds: 10
        periodSeconds: 10
        successThreshold: 3
        timeoutSeconds: 5

Expected behavior:

Curl probe should have about ~3 sec connect timeout. E.g. use --connect-timeout 3 or --max-time 3.

How to reproduce it (minimal and precise):

n/a

File(s) to submit:

diff --git a/pkg/operator/ceph/object/rgw-probe.sh b/pkg/operator/ceph/object/rgw-probe.sh
index 853b430bc..20297f572 100644
--- a/pkg/operator/ceph/object/rgw-probe.sh
+++ b/pkg/operator/ceph/object/rgw-probe.sh
@@ -17,11 +17,12 @@ RGW_URL="$PROBE_PROTOCOL://0.0.0.0:$PROBE_PORT"

 function check() {
   local URL="$1"
+  # --max-time 3 - override the default 2 minutes wait time down to 3 seconds
   # --insecure - don't validate ssl if using secure port only
   # --silent - don't output progress info
   # --output /dev/stderr - output HTML header to stdout (good for debugging)
   # --write-out '%{response_code}' - print the HTTP response code to stdout
-  curl --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
+  curl --max-time 3 --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
 }

 http_response="$(check "$RGW_URL")"

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health To get the status of the cluster, use kubectl rook-ceph ceph status For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

BlaineEXE commented 2 weeks ago

I'll take a look at this in the coming days. I want to be sure that the best solution is indeed the curl --max-time flag. When I was implementing the RGW probe, I remember looking into that flag and rejecting it, but I forget my rationale at present.

BlaineEXE commented 2 weeks ago

@kayrus what do you mean by "accumulate" here?

When there is a network hiccup, the bash scripts with a curl probe start to accumulate

Can you provide logs that show what you mean?

kayrus commented 2 weeks ago

Can you provide logs that show what you mean?

there are no logs, there are just dozens of bash and curl processes trying to access the endpoint

BlaineEXE commented 2 weeks ago

How are you determining that to be true? I would like to see some kind of output of proof that this is happening. My understanding of the kubernetes probes is that when they time out, kubernetes kills the process (bash), which should also kill the curl process in the script.

BlaineEXE commented 2 weeks ago

The dilemma I face is thus:

If we a CURL timeout value, we may introduce unexpected issues for other Rook users
A CURL timeout value also means that the script timeout differs from the timeout that users configure via the probe config
Introducing an interior timeout when an exterior timeout is also running seems unnecessary and only risks timing-related bugs

Because of the risk of introducing errors by making the proposed change, I want to make sure we are certain that there is an issue present. Rook has gotten the pod probes wrong in the past, and it has taken us a lot of effort to develop probes that are useful and bug-free, and to make sure that we are not implementing probes or probe features that risk causing cascading system failures.

We have also had Rook users in the past submit issues based on theory rather than on observed behavior. Since you are not providing evidence via observed behavior, I am still concerned that the reasoning behind the creation of this issue could be theoretical and not evidence-based. When users have reported theoretical issues in the past, it has led to us introducing unnecessary complexity into Rook, and it has sometimes caused bugs.

I did some in-depth investigation today, and I have findings to share.

Findings

Research

This is the information I can find on the topic of probe "accumulation" (from June 2022): https://github.com/kubernetes/kubernetes/issues/110070#issuecomment-1150217612

In summary, cri-o and containerd runtimes will terminate probe processes if they pass the timeout threshold, preventing process "accumulation". The dockershim runtime does not have this ability, but since dockershim is not a supported K8s configuration, k8s will not fix the behavior. Because dockershim was already an unsupported runtime 2 years ago, I don't see any compelling reason that we should consider this a Rook bug in 2024.

Testing

To confirm that there isn't a bash issue present, I looked into how the script behaves in isolation when it receives a SIGTERM signal. When the SIGTERM signal is received while http_response="$(check "$RGW_URL")" is running, I see that the curl process stops and does not remain running.

Conclusion

Because this involves liveness probes, I am taking this issue seriously. That also means that I am doing my own due diligence to be certain that Rook is doing the best thing for all its users.

And because my research and testing didn't find anything wrong, it means that I must put the burden of proof on you to show that the issue is real. For us to move forward with the bug fix, please show us via some sort of console output that the process accumulation is definitely happening in your environment and that the container runtime is something other than dockershim.

kayrus commented 2 weeks ago

@BlaineEXE thanks a lot for your extra research. I'll try to provide more evidences next week.

BlaineEXE commented 4 days ago

Hi @kayrus I wanted to check in to see how things are progressing on your side.

I'm also happy to try to see if I can repro, if you can help me understand what you mean by "network hiccup." I'm not sure what would cause networking between a probe and pod to stall, but I might be able to simulate it if I understood more about the situation you're describing.

kayrus commented 2 days ago

@BlaineEXE apologies, I was having other more important issues to fix, like keystone auth, etc.

Regarding the hanged curl processes, in our case we're using containerd, and according to the ticket you provided this problem doesn't take place with containerd. I need some more time to reproduce an issue on our side.

rook / rook