spring-projects / spring-boot

Spring Boot helps you to create Spring-powered, production-grade applications and services with absolute minimum fuss.
https://spring.io/projects/spring-boot
Apache License 2.0
75.18k stars 40.68k forks source link

Allow health groups to be configured at an additional path #25471

Closed bric3 closed 3 years ago

bric3 commented 3 years ago

Motivation

In a Kubernetes production with Istio and prometheus metrics.

We noticed that servicemonitor starts trying to fetch prometheus metrics very early, before the application is ready, this results in a noticeable 503 during the rollout.

Since grabbing the metrics is a separate concern than serving metrics, we wanted to expose those on a different port in order to exclude the port from Istio.

    annotations:
      # Make Istio not listening on management.server.port 8081
      traffic.sidecar.istio.io/excludeInboundPorts: "8081"

This is possible as documented here : https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-customizing-management-server-port

There's always the possibility to change the management (actuator) port, but the documentation actually warns about trusting the health endpoints with this setup: https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes

If your Actuator endpoints are deployed on a separate management context, be aware that endpoints are then not using the same web infrastructure (port, connection pools, framework components) as the main application. In this case, a probe check could be successful even if the main application does not work properly (for example, it cannot accept new connections).

Suggestion

Could it be worth to distinguish two class of actuators ?

  1. The ones that represent the health of the service, and exposed on they same port, connection pool, framework than the service.
  2. The others that are here to provide other features, like metrics, and can be declared to use a different web infrastructure without health endpoint.
wilkinsona commented 3 years ago

We discussed this today and have a few ideas that we'd like to explore. In the meantime, I don't think you need to worry about the warning in the documentation. If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

bric3 commented 3 years ago

Hi, thank you for taking this in consideration.

If Istio is monitoring the success rates for requests hitting the service, it is mitigating the risk that the warning describes.

Indeed, the monitoring by Istio is certainly mitigating this warning, yet the lurking issue is about Istio allowing real request while the application main web infrastructure is not ready or live. Retry policy could help, but this may probably not what we want for non GET/HEAD requests. Imagine a change in configuration, that leads to more requests waiting on IO, this may lead to (Tomcat) connector saturation issues.

wilkinsona commented 3 years ago

Irrespective of the management port that's being used, the readiness probe won't report that the application is ready to handle traffic until it really is ready. As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

bric3 commented 3 years ago

Just to weigh in, the following statement may not hold well when the process allows to deploy often, like multiple time a day.

As long as there's no fundamental problem with your application endpoints, Istio's monitoring and the liveness and readiness probes should give you everything that you need.

And typically we got caught with an incorrect configuration change that was deployed too early (before the correct docker image was deployed), resulting in many unsatisfied requests waiting on the third party dependency that was misconfigured, this lead to saturation on the connector of the main application. The problem would have been detected earlier if the liveness probe failed at this time.

EDIT: Another pod, another story. This time for some unknown reason a single pod decided to go wrong. Newrelic agents failed to instrument properly our JAX-RS/Jersey servlet, and as such the servlet couldn't handle traffic, yet the health probe was reporting a 200 OK status as it was exposed on a different servlet.

wilkinsona commented 3 years ago

Reopening to remind us to update the release notes.

meiyese commented 11 months ago

Hi

as the document says:

For this reason, is it a good idea to make the liveness and readiness health groups available on the main server port. This can be done by setting the following property:

Does the document have some types, and would it be better to have an independent port?

@wilkinsona @mbhave

philwebb commented 11 months ago

@meiyese I'm not sure I understand what you're asking, but if it's a general question then posting the question on stackoverflow.com would be better.

meiyese commented 11 months ago

@meiyese I'm not sure I understand what you're asking, but if it's a general question then posting the question on stackoverflow.com would be better.

Hi, @philwebb I said that if it should mean:

"For this reason, it is a good idea to make the liveness and readiness health groups available on a separate server port."

philwebb commented 11 months ago

@meiyese The current documentation is correct. It's saying that even if you use a different management port it's best to set management.endpoint.health.probes.add-additional-paths=true so that they are also available on the main application port.