oracle / weblogic-kubernetes-operator

WebLogic Kubernetes Operator
https://oracle.github.io/weblogic-kubernetes-operator/
Universal Permissive License v1.0
254 stars 212 forks source link

Feature Request: startup probes #3423

Closed belfo closed 9 months ago

belfo commented 2 years ago

Would be nice to have (next to readinessProbe and livenessProbe probe) the startup probes. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

rjeberhard commented 2 years ago

Hi @belfo,

Thanks for reaching out. We had considered adding a startup probe but hadn't thought that it was necessary because our current liveness probe is designed to succeed while a WebLogic Server instance is starting.

While the readiness probe is an HTTP probe that attempts to connect to the standard WebLogic "ReadyApp" endpoint, the liveness probe is instead a small script that is executed in the container.

This script validates that the node manager process is running and that the node manager is reporting some other WebLogic Server state other than FAILED_NOT_RESTARTABLE. This means that the liveness probe will pass while the state is still STARTING.

Or, do you have a different intention such as wanting the liveness probe to fail if the server instance isn't in the Ready condition after some timeout?

belfo commented 2 years ago

In fact i had to put a very large initialDelaySeconds on the liveness probe as it was killing the container before weblogic was starting (still in the operator part), so it was restarting continuously (this is clearly caused by another issue in our cluster who is too slow) The startup at least will avoid this as it won't kill the container.

My solution works (for me) but then it could take more time to be ready once the slowness is fixed.

rjeberhard commented 2 years ago

Do you happen to have any logs from the servers that were killed prior to your setting the large initialDelaySeconds?

belfo commented 2 years ago

I haven't kept logs but what i have is the exit code:

Containers: weblogic-server: Container ID: containerd://f2baeb6839bb88e66448d84dcc7baddf862801caa665f598542573ad61538d6a Image: nexus.priv:9003/middleware/weblogic/wls:12.2.1.4.0.bcprov.vap.patch2 Image ID: nexus.itsmtaxud.priv:9003/middleware/weblogic/wls@sha256:f2af756c4df2358f27cb8f51c8469d83bea4594a6a5ec95be830d5d837f086ed Port: 7000/TCP Host Port: 0/TCP Command: /weblogic-operator/scripts/startServer.sh State: Running Started: Fri, 16 Sep 2022 10:21:06 +0200 Last State: Terminated Reason: Error Exit Code: 137 Started: Fri, 16 Sep 2022 10:19:16 +0200 Finished: Fri, 16 Sep 2022 10:21:04 +0200 Ready: False Restart Count: 5

And the events: 17h Normal Started pod/host-admin-server Started container weblogic-server 84m Warning Unhealthy pod/host-admin-server Readiness probe failed: Get "http://192.168.2.32:7000/weblogic/ready": dial tcp 192.168.2.32:7000: connect: connection refused 17h Warning Unhealthy pod/host-admin-server Liveness probe failed: @[2022-09-15T16:21:18.419112699Z][livenessProbe.sh:77][SEVERE] WebLogic NodeManager process not found. 10h Normal Killing pod/host-admin-server Container weblogic-server failed liveness probe, will be restarted 17h Warning FailedPreStopHook pod/host-admin-server Exec lifecycle hook ([/weblogic-operator/scripts/stopServer.sh]) for Container "weblogic-server" in Pod "host-admin-server_dev(040c8163-7417-4bc4-ae2e-283d470e2707)" failed - error: command '/weblogic-operator/scripts/stopServer.sh' exited with 1: /weblogic-operator/scripts/stopServer.sh: line 22: /u01/domains/base_domain/servers/admin-server/logs/admin-server.stop.out: No such file or directory

And the pod logs was showing up to [FINE] Exiting encrypt_decrypt_domain_secret

By setting the following high value, i was able to start (i haven't test any finetuning, just put some big value to be sure, but took around ~5min to have the weblogic starting) serverPod: livenessProbe: initialDelaySeconds: 900 periodSeconds: 120 timeoutSeconds: 60 failureThreshold: 5 readinessProbe: initialDelaySeconds: 300 periodSeconds: 90 timeoutSeconds: 60 failureThreshold: 5

tbarnes-us commented 2 years ago

It is unusual to see '[livenessProbe.sh:77][SEVERE] WebLogic NodeManager process not found'. The NM is started very soon after the pod starts and before WebLogic Server itself is started, and we (or at least I) have yet to see an example of it crashing. It could be that the pod is somehow taking a very long time to get to the point where it starts an NM, or that the exit information ^^^ is misleadingly reflecting what's happening as the pod is forced to shut down due to the timeout (which I assume would in turn bring down the NM).

I agree with @rjeberhard that a pod log would be helpful here. I think even a log from a successful run would help, as that should help reveal the timings of the pod's startup activity.

belfo commented 2 years ago

Hello @tbarnes-us Indeed it's unusual, it was related to resource availability of the underlying K8s nodes. Once the node got more resources (some logical limitation on the vcenter) all was good.

But the idea of having startup probes still make sense on my point of vue. Worst case they are useless, best case they can at least prevent the pod to restart when not needed.

tbarnes-us commented 2 years ago

A reproducer pod log would help evaluate the idea - e.g. whether the startup probe would help in the first place, and, if so, how the startup probe would need to be coded...

Overall, it is strongly recommended to tune the pod's CPU & memory asks so that at least no attempts are made to start without a sufficient amount of those two resources (see the FAQ). Do you know what the 'logical limitation on the vcenter' was?

belfo commented 2 years ago

I have no more the issue so hard to reproduce. But the weblogic was not yet started (probably the same check as liveness could be enought?) with the advantage that it will not kill the pod until it's marked as started. The limitation was around 20GB ram & GHz for all the node.... we have 8 nodes. So clearly not enough.

tbarnes-us commented 2 years ago

@rjeberhard @belfo

I recommend closing this Issue because the root problem (lack of allocated resources) has been fixed, and, in my opinion, a wide variety of failures and retries are expected when too few resources are allocated. It can be revisited if/when the problem is reproduced with sufficient data for a full diagnosis (pod logs, etc).

Thoughts?