operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

Startup Probe kills "/bin/opm serve" process and prevents operatorhubio pod to start #3269

Open fjammes opened 5 months ago

fjammes commented 5 months ago

Type of question

General context and help around the operator-sdk

Question

What did you do?

Install operator-sdk v0.28.0

What did you expect to see?

Operator startup

What did you see instead? Under which circumstances?

Operatorhubio pod does not start:

runner@arc-runners-x2src-runner-mxhq2:~$ kubectl get pods -A | grep operatorhubio
olm                  operatorhubio-catalog-gqxnw                  0/1     CrashLoopBackOff   15 (4m6s ago)   55m
runner@arc-runners-x2src-runner-mxhq2:~$ kubectl describe pods -n olm operatorhubio-catalog-gqxnw | tail -n 5
  Normal   Pulled     52m                    kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 16.469534578s
  Normal   Created    52m (x2 over 54m)      kubelet            Created container registry-server
  Normal   Started    52m (x2 over 54m)      kubelet            Started container registry-server
  Warning  Unhealthy  5m47s (x150 over 54m)  kubelet            Startup probe failed: timeout: failed to connect service ":50051" within 1s
  Warning  BackOff    42s (x132 over 42m)    kubelet            Back-off restarting failed container
runner@arc-runners-x2src-runner-mxhq2:~$ kubectl logs  -n olm operatorhubio-catalog-gqxnw
time="2024-05-21T10:21:02Z" level=info msg="starting pprof endpoint" address="localhost:6060"
time="2024-05-21T10:21:02Z" level=info msg="found existing cache contents" backend=pogreb.v1 cache=/tmp/cache configs=/configs

Process seems to freeze for 2/3 minutes at the step logged above.

Environment

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-09-01T23:30:43Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.25) exceeds the supported minor version skew of +/-1

ARC and kind based:

kind version
kind v0.15.0 go1.19 linux/amd64

Additional context

The command /bin/opm serve /configs --cache-dir=/tmp/cache takes ~2/3 minutes to start in this container and this trigger the startupProbe. This occurs only on one of our infrastructure. Is there a way to increase the probe duration or to debug what's happening in opm process?

jkranner commented 4 months ago

I am also facing this issue. Pod: operatorhubio-catalog-ql6bs Startup probe failed: timeout: failed to connect service ":50051" within 1s Then keeps crash-looping.

fjammes commented 3 months ago

Still blocked on this issue when OLM runs on servers with slow disks. Is there a way to configure the startupProbe throught olm install procedure?

mateuszkca commented 3 months ago

I have this problem too. I found that it only affects k8s running on workers with CentOS9 /Rocky 9 OS regardless of docker version. On that node catalog POD starts about 75 seconds. For CentOS 8/Rocky 8 there is no problem and catalog POD starts in 6 sec. Any progress in resolving this problem?

grokspawn commented 2 months ago

OLMv0 does not currently support a configurable startupProbe ref. If it's not too late in the day for me to math, that is 100 seconds of startup delay. Without better understanding of what's going here I'd be reluctant to advocate for any arbitrary duration bump, just because it might be pushing off the issue to another day.

There are a couple of things that you could do to try to get a better understanding of why your catalog pods are taking so long to be ready:

  1. you can disable cache validation in running instances of your catalog. Right now it looks like you have a pre-generated catalog cache and opm will validate that cache when opm serve starts. If you disable this and get a better experience then we'd have a better focus on what needs to be improved; and/or
  2. opm serve exposes a pprof endpoint for creation...readiness CPU profiling, discrete from an optional default pprof endpoint. You would need to port-forward (or ssh to the pod and access) localhost:6060