jaredpetersen commented 3 years ago

I'm using the Actuator project to display health statuses for downstream services. My application is utilizing an optional downstream service; the service is used all the time normally but we have fallbacks in place if it goes down for whatever reason.

Optional services are not supported at this time and I'd like to request that we change that. The problem is that the HealthIndicator for the optional service will show the downstream service as being down, that DOWN status will be aggregated to the overall service status, and the health check will fail because of a non-200 status code. This is not desirable. Optional downstream services should show up as "DOWN" but not have that status aggregated.

This enhancement has been requested in the past a couple of times over the years:

The general recommendation was to build a custom HealthAggregator. HealthAggregator has been deprecated so the replacement recommendation is a StatusAggregator + Health Groups. This does not work.

Here's a StatusAggregator that I built for a health group named "optional":

@Qualifier("optional")
@Component
public class OptionalStatusAggregator implements StatusAggregator {
    @Override
    public Status getAggregateStatus(Set<Status> statuses) {
        return Status.UP;
    }
}

And here's my actuator configuration:

management:
  endpoint:
    health:
      show-details: always
      group:
        optional:
          include:
            - someservice
  endpoints:
    web:
      base-path: /
      exposure:
        include:
          - info
          - health

The intent is to always show "UP" for the optional health group since none of these are critical for the overall system health. The components in the group will still have their individual statuses but they are to be ignored once aggregated.

If you hit the health group's health endpoint (localhost:8080/health/optional), the aggregation works as intended. However, my custom status aggregator is not used for the overall health endpoint (localhost:8080/health) so the "DOWN" status is propagated up to the service health.

The only solution I can see is to create a StatusAggregator that's used for everything. However, the StatusAggregator does not provide where the statuses came from, only the raw Status values of "UP", "DOWN", etc. in the form of a Set. There is no way to have the aggregator ignore the statuses of some downstream services because there's no context. It was proposed that this be fixed in https://github.com/spring-projects/spring-boot/issues/23759 but the team decided not to extend the API of the StatusAggregator.

The recommendation was to instead create a custom status called "PARTIAL_DOWN" and create a status aggregator to handle that. The problem with this recommendation is that the downstream service that is actually down will show up with the status of "PARTIAL_DOWN" which is not true and the overall status will be "PARTIAL_DOWN" (kind of true) or "UP" (true) depending on how the StatusAggregator is implemented.

I'd like you to reconsider adding the component context information into StatusAggregator as an enhancement. This information is needed in order to do any sort of meaningful status aggregation. In the current state, the existing SimpleStatusAggregator is effectively the only possible way to aggregate statuses unless you choose to hardcode the status like I did in my example (which is not useful for the overall health endpoint status aggregation). At this time, I will not be able to add my optional downstream services to the health endpoint and there will unfortunately not be any visibility into how they are doing from the application's perspective.

bclozel commented 3 years ago

Indeed, this has been asked a few times and so far the team pointed at a custom HealthAggregator or health groups as the preferred solutions for this.

I understand this is still a valid concern, so let's try to take a step back. The main problem here is that:

we want custom health indicators to give us information about the broader environment, such as remote services availability, rate-limiting for API tokens, or just the health of services that are not vital to the application (since we might have sensible fallbacks for them)
an indicator being down doesn't mean that the entire application should be considered as unavailable. Again, the application can have fallbacks for those.
there should be a difference between "the application has a problem" and "the application is really broken and needs to be restarted". If we don't make that difference, a 3rd party system having issues might take down all services depending on it.
there should also be a difference between "the application is not available" and "the application is not ready to serve traffic". When applications need to perform long-running tasks at startup before serving traffic, it can be challenging to configure the health checks to adapt to that while not configuring super long timeouts overall.
platform health checks are usually pointing at /actuator/health - so if something fails, the whole application is considered as failing and the platform takes action (usually restarting the app, raising notifications, etc)

Since #19593, Spring Boot ships support for liveness and readiness probes. I think they're solving all issues listed above. Once enabled, developers can point the health checks to /actuator/health/liveness and /actuator/health/readiness for optimal platform checks. We can still look at the global /actuator/health for more information.

22825 will make those health groups enabled by default and I think this approach makes more sense in the long run.

Would this approach solve your current use case? If not, could you explain what's missing?

jaredpetersen commented 3 years ago

@bclozel Thanks for your quick response!

Points 1 - 3 and 5 are a correct assessment of the problem. 4 is not really relevant because if it's an optional downstream service with appropriate fallbacks, the main service will always be ready for traffic.

I took a look at the liveness and readiness probes a bit more in detail and used this as my application config to get a grasp on it:

management:
  endpoint:
    health:
      show-details: always
      probes:
        enabled: true
      group:
        readiness:
          include:
            - db
            - redis
        liveness:
          include:
            - db
            - redis
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true

It looks like you need the services you actually want to affect the probes explicitly defined in the readiness/liveness groups. So since we have a fallback for our optional service, we don't want that to be included in either group.

This looks like it solves the problem of automated health checks since the readiness/liveness probes only show the components that are needed to make the application run. If you want greater details of all of the individual components, humans can look at the regular health endpoint.

However, I think that this is a bit of a bandaid. The greater health endpoint shows the statuses of everything but lies to you (the human) by saying that the overall application is down. I'd say that the problem is only partially solved until the StatusAggregator interface is adjusted to provide the component name context.

bclozel commented 3 years ago

@jaredpetersen thanks for your reply!

Point 4) is still key - in your case, how is the platform responding to an /actuator/health endpoint being "DOWN" in general? Also, how does the platform decides that an application instance is up and can process traffic? What happens if an application is available but takes time to process startup tasks (like ApplicationRunner tasks)?

The issue here is that the "DOWN" state for an application is subjective. For sure, it's down if you can't even connect to it. But with fallbacks, optional services, or health indicators used to track the state of external systems - there are a lot of shades to consider.

We decided in the past to not extend this part because we felt like it would introduce more complexity in the design and not really solve the core problem: the health of an application is a matter of perspective. We thought that creating several specialized views with a simple status aggregation logic (health groups) would be easier to reason about.

Now that liveness and readiness are part of the picture, we need to decide whether health groups are the best way to go or if we should improve the contracts here.

I'm marking this issue for team attention and we'll discuss this. Thanks!

jaredpetersen commented 3 years ago

@bclozel

Point 4) is still key - in your case, how is the platform responding to an /actuator/health endpoint being "DOWN" in general? Also, how does the platform decides that an application instance is up and can process traffic? What happens if an application is available but takes time to process startup tasks (like ApplicationRunner tasks)?

Sure, this is relevant generally speaking. You want your application manager (Kubernetes, etc.) to know that your service is up and ready to receive traffic. But in the context of downstream dependencies with suitable fallbacks, these indicators are always going to show healthy so that your application isn't killed off. In that case, we just want some way for a human to see from the application's perspective that performance is degraded.

The issue here is that the "DOWN" state for an application is subjective. For sure, it's down if you can't even connect to it. But with fallbacks, optional services, or health indicators used to track the state of external systems - there are a lot of shades to consider.

100%. I view "DOWN" as the application is basically dead, please send help. And even with these liveness and readiness checks in place which can be set to ignore the health of these services with fallbacks, the overall health endpoint still shows that something is very broken.

We decided in the past to not extend this part because we felt like it would introduce more complexity in the design and not really solve the core problem: the health of an application is a matter of perspective. We thought that creating several specialized views with a simple status aggregation logic (health groups) would be easier to reason about.

I definitely understand where you're coming from here. But it's more confusing when an "optional" service makes the entire application health look like it's totally broken.

Ideally, there would be a simple config where you can pass which checks are optional and should not be aggregated. In that case, no one would need to override the StatusAggregator at all to remove these optional services from aggregation. This isn't a totally off-the-wall problem so having it built right into the configuration would be very useful. But I would settle for an enhanced contract so that I can do my thing without impacting others in a way they may find confusing.

I'm marking this issue for team attention and we'll discuss this. Thanks!

Thank you so much for your help on this and your careful, nuanced thoughts on the matter. Looking forward to hearing the results of your discussion 😄

jaredpetersen commented 3 years ago

I see that the label changed -- what was the result of the meeting?

wilkinsona commented 3 years ago

We need to do some design work to figure out what the next steps should be.

rellikt commented 3 years ago

Is there any progress? Is there a chance that this will be implemented in one of the upcoming releases?

wilkinsona commented 3 years ago

@rellikt We don't have any firm plans at the moment and have not yet had a chance to do the necessary design work.

FinalFrag commented 2 years ago

Are there any updates on this topic? I'm looking to set up exactly what is being discussed here.

Is there a workaround to get a Map of the name and the status in the StatusAggregator?

wilkinsona commented 2 years ago

I'm afraid not. Updates will be added to this issue when we have them.

wimdeblauwe commented 2 years ago

I just created https://stackoverflow.com/questions/72854464/exclude-mail-health-indicator-from-default-actuator-health-endpoint, but now I found this issue that seems to be related.

In my example, I don't want /actuator/health to return DOWN if the mail server is down, otherwise Azure will think my application is down and will try to restart it.

I guess for now the only workaround would be to create a new group (e.g. withoutmail), exclude mail and point Azure to that group (/actuator/health/withoutmail)?

FinalFrag commented 2 years ago

Mine was a very similar situation. In the end we decided to put all mandatory dependencies in the health check and move all optional dependencies to the info section by implementing an InfoContributor for each dependency.

imosapatryk commented 1 year ago

Any news about the feature being introduced? I was trying to aggregate the statuses for composite contributors with specific behaviour for specific composite. Passing the name of the composite to the status aggregator would be really enough, especially when the name is carried by the aggregate method.

wilkinsona commented 1 year ago

@imosapatryk please see https://github.com/spring-projects/spring-boot/issues/25151#issuecomment-1160635065.

imosapatryk commented 1 year ago

I understand but it's been almost a year and still nothing on the horizon. Would be really nice get to know if at least you are planning to introduce this change.

wilkinsona commented 1 year ago

If we had no plans to introduce the change, the issue would have been closed. The issue is in the general backlog milestone which is described as an "unordered set of issues that we want to solve in a future release". That's an accurate description of the state of this issue. It's an enhancement that we'd like to make but we have no specific plans for the timing of the change.

ferrarimartin commented 1 year ago

Hello, just to add my use case for this, I was hoping to use /actuator/health as both a readiness probe and a source for firing alerts. So, if redis is down, the service would still show as ready, but there could be alerts fired stating that somethins is wrong with redis, and maybe alerts about the service being degraded. That would be really natural to think about and implement, I believe. It can be done with health groups, but it involves more work declaring the groups, and the /actuator/health endpoint would still show a state that's not real (the whole service down if there's an optional component down). Thanks!!

voliveirajr commented 1 year ago

Checking if there is any expectation of this getting implemented soon and what would be the suggestion for a workaround till we get it. In my situation we have a service that supports multiple "flows", the health status of it is relevant for different clients, Client 1 uses flows A and B, Client 2 flows A and C. The main problem for us is even using the agregators to compose it any unhealthy flow will set the whole application down. Suggestions?

bclozel commented 1 year ago

@voliveirajr I think your flows use case points to health groups, which is already implemented in Spring Boot. You can point each client to a different health group. I don't see how this particular issue would solve your problem, as clients would have no way to select which "flow" they're getting when they're querying the root health endpoint.

bclozel commented 1 year ago

Since this issue was created, I think many platforms are now considering the liveness and readiness aspects of an application and use separate endpoints. The root endpoint and aggregated status doesn't reflect the "overall status of the application", but rather the aggregated status of all configured indicators.

I think we should now close this issue; the status aggregator route introduces complex logic where this preference should be configured in a declarative way by creating health groups.

jaredpetersen commented 1 year ago

The focus here is on services that have fallbacks and wanting a way to cleanly get statuses for these aggregated up properly. Services/features that have a fallback are called "optional" here; the application works, but maybe in a deteriorated state. You don't want it to fail the whole health check yet. If the fallback fails (everything in the group fails), then you want to fail the health check. So the specific request is for a more nuanced take on health checks and status aggregation.

I'll admit that this could have been more effectively distilled back in 2021, but you live and learn 🙂.

Readiness and liveness aren't a replacement here. That's something else entirely, and it already existed at the time. You don't want a service/feature with a functioning fallback to fail either probes as that will get your application killed.

Health groups were never a solution either, and also existed at the time. If anything in the health group fails, the whole group fails. There's no nuance, no support for optionality.

The solutions seems to be:

Native support for optionality
Custom status aggregation written by the developer, as referenced previously
Custom status check written by the developer that handles that fallback nuance internally -- has the downside of only being able to expose UP/DOWN externally

If the organization wants to reject it on the basis of introducing too much complexity or not having enough value, that's okay! There are other options here aside from native support. But this isn't something that is solved by the framework today without some extra work.

sheetalj2205 commented 8 months ago

Hey @jaredpetersen @bclozel is there any update on the aggregate status of the service? I also want to implement this, I don't want the aggregate status to be DOWN if any of the optional services is DOWN.

sheetalj2205 commented 7 months ago

Mine was a very similar situation. In the end we decided to put all mandatory dependencies in the health check and move all optional dependencies to the info section by implementing an InfoContributor for each dependency.

Hey, @FinalFrag How you're configuring this in yaml file so it will also get triggered after every 15 seconds like /health?

spring-projects / spring-boot

Actuator - support optional downstream services #25151

22825 will make those health groups enabled by default and I think this approach makes more sense in the long run.