Open fjammes opened 5 months ago
I am also facing this issue.
Pod: operatorhubio-catalog-ql6bs
Startup probe failed: timeout: failed to connect service ":50051" within 1s
Then keeps crash-looping.
Still blocked on this issue when OLM runs on servers with slow disks. Is there a way to configure the startupProbe throught olm install procedure?
I have this problem too. I found that it only affects k8s running on workers with CentOS9 /Rocky 9 OS regardless of docker version. On that node catalog POD starts about 75 seconds. For CentOS 8/Rocky 8 there is no problem and catalog POD starts in 6 sec. Any progress in resolving this problem?
OLMv0 does not currently support a configurable startupProbe ref. If it's not too late in the day for me to math, that is 100 seconds of startup delay. Without better understanding of what's going here I'd be reluctant to advocate for any arbitrary duration bump, just because it might be pushing off the issue to another day.
There are a couple of things that you could do to try to get a better understanding of why your catalog pods are taking so long to be ready:
opm
will validate that cache when opm serve
starts. If you disable this and get a better experience then we'd have a better focus on what needs to be improved; and/oropm serve
exposes a pprof endpoint for creation...readiness CPU profiling, discrete from an optional default pprof endpoint. You would need to port-forward (or ssh to the pod and access) localhost:6060
Type of question
General context and help around the operator-sdk
Question
What did you do?
Install operator-sdk v0.28.0
What did you expect to see?
Operator startup
What did you see instead? Under which circumstances?
Operatorhubio pod does not start:
Process seems to freeze for 2/3 minutes at the step logged above.
Environment
operator-lifecycle-manager version: v0.28.0
Kubernetes version information:
ARC and kind based:
Additional context
The command
/bin/opm serve /configs --cache-dir=/tmp/cache
takes ~2/3 minutes to start in this container and this trigger the startupProbe. This occurs only on one of our infrastructure. Is there a way to increase the probe duration or to debug what's happening in opm process?