spring-projects / spring-boot

Spring Boot
https://spring.io/projects/spring-boot
Apache License 2.0
74.74k stars 40.58k forks source link

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

Open joshiste opened 4 months ago

joshiste commented 4 months ago

I try to describe our use case and the problem we have:

We're using Prometheus to scrape the metrics. We set server.port=8080 and management.server.port=9090 (hence a second http server is used). Stopping the application gracefully can take longer since the app has long-running processes that we're waiting on. While waiting on these, we want the default server to be shutting down, but the management server to be up, so we can still scrape the metrics. Currently, the management server is started after and stopped before the default server, preventing this. The phases/order for the servers cannot changed in any way.

I totally acknowledge that the current order is the way it is, to not serve the health endpoints before the default server is up. And as discussed in #31714 that the phases must be well configured and are easy to get wrong. But I'd love to have some kind of possibility to change the order (e.g. by subclassing).

wilkinsona commented 4 months ago

This isn't really related to the lifecycle phases as they're not involved in closing the management context which is done by org.springframework.boot.actuate.autoconfigure.web.server.ChildManagementContextInitializer.CloseManagementContextListener in response to the parent context's ContextClosedEvent.

Unfortunately, I think it will be quite difficult to allow the ordering to be changed as we'd have to move away from using the ContextClosedEvent to close the management context. A Lifecycle or SmartLifecycle would seem like an obvious choice as the phase could then be configured but the application context does not expose the state of its closed flag so I don't think it would be possible for us to distinguish between a stop() call that should just stop() the management context and a stop() call that should close() it.

wilkinsona commented 3 months ago

I've opened https://github.com/spring-projects/spring-framework/issues/33058 to see if Framework could make the application context's close state accessible to us.

jonatan-ivanov commented 3 months ago

I think one alternative solution to this could be using Prometheus RSocket Proxy (but you need to deploy an extra component in your infrastructure).

In the use-case above, if Prometheus does not scrape while the long-running processes is running, or one/some of the scrapes fail or Prometheus is not scraping enough, I think you can be in a similar situation even if the management endpoint is still able to accept traffic.

In case of the Prometheus RSocket Proxy, both the Proxy can scrape the app and the app can also send data to the Proxy (that is scraped by Prometheus later). So if the ordering is right, your app can send the latest data to the Proxy right before the process stops (after your long-running process finished its job).

wilkinsona commented 3 months ago

Framework 6.2 now provides an isClosed() accessor backed by its closed flag. That means that we may be able to rework things here so that the separate management context is closed as part of a lifecycle implementation rather than in response to the ContextClosedEvent. We can investigate further once we've created the 3.3.x branch and main has upgraded to Framework 6.2.0-M5 or its snapshots.

wilkinsona commented 2 days ago

isClosed() seems to give us what we need: https://github.com/wilkinsona/spring-boot/tree/gh-41002. With these changes, the management context is stopped when the main app context is stopped and it's closed when the main app context is closed. The phase is such that the management web server doesn't start to shut down until the main app's web server has completed its, potentially graceful, shutdown.