Closed bric3 closed 3 years ago
We discussed this today and have a few ideas that we'd like to explore. In the meantime, I don't think you need to worry about the warning in the documentation. If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.
Hi, thank you for taking this in consideration.
If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.
Indeed, the monitoring by Istio is certainly mitigating this warning, yet the lurking issue is about Istio allowing real request while the application main web infrastructure is not ready or live. Retry policy could help, but this may probably not what we want for non GET/HEAD requests. Imagine a change in configuration, that leads to more requests waiting on IO, this may lead to (Tomcat) connector saturation issues.
Irrespective of the management port that's being used, the readiness probe won't report that the application is ready to handle traffic until it really is ready. As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.
Just to weigh in, the following statement may not hold well when the process allows to deploy often, like multiple time a day.
As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.
And typically we got caught with an incorrect configuration change that was deployed too early (before the correct docker image was deployed), resulting in many unsatisfied requests waiting on the third party dependency that was misconfigured, this lead to saturation on the connector of the main application. The problem would have been detected earlier if the liveness probe failed at this time.
EDIT: Another pod, another story. This time for some unknown reason a single pod decided to go wrong. Newrelic agents failed to instrument properly our JAX-RS/Jersey servlet, and as such the servlet couldn't handle traffic, yet the health probe was reporting a 200 OK status as it was exposed on a different servlet.
Reopening to remind us to update the release notes.
Hi
as the document says:
For this reason, is it a good idea to make the liveness and readiness health groups available on the main server port. This can be done by setting the following property:
Does the document have some types, and would it be better to have an independent port?
@wilkinsona @mbhave
@meiyese I'm not sure I understand what you're asking, but if it's a general question then posting the question on stackoverflow.com would be better.
@meiyese I'm not sure I understand what you're asking, but if it's a general question then posting the question on stackoverflow.com would be better.
Hi, @philwebb I said that if it should mean:
"For this reason, it is a good idea to make the liveness and readiness health groups available on a separate server port."
@meiyese The current documentation is correct. It's saying that even if you use a different management port it's best to set management.endpoint.health.probes.add-additional-paths=true
so that they are also available on the main application port.
Motivation
In a Kubernetes production with Istio and prometheus metrics.
/actuator/prometheus
We noticed that servicemonitor starts trying to fetch prometheus metrics very early, before the application is ready, this results in a noticeable 503 during the rollout.
Since grabbing the metrics is a separate concern than serving metrics, we wanted to expose those on a different port in order to exclude the port from Istio.
This is possible as documented here : https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-customizing-management-server-port
There's always the possibility to change the management (actuator) port, but the documentation actually warns about trusting the health endpoints with this setup: https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes
Suggestion
Could it be worth to distinguish two class of actuators ?