micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.48k stars 990 forks source link

Support for cancelling an Observation #4061

Open mhalbritter opened 1 year ago

mhalbritter commented 1 year ago

Hello,

The use case is this issue on the Boot tracker: https://github.com/spring-projects/spring-boot/issues/34801 - TLDR: We don't want metrics for the actuator endpoints.

The way the metrics for the actuator endpoint get created is: ServerObservationFilter -> DispatcherServlet -> MVC Handler mapping -> Spring Boot actuator infrastructure -> actuator method

We tried to solve this by using an ObservationPredicate and then extract the URL from the observation context, and then check if this is an actuator endpoint by using string matching on the path (see https://github.com/spring-projects/spring-boot/pull/36455).

This logic is brittle and we're not happy with that. We already have logic in place where we know exactly that an actuator endpoint has been hit: when handling it in the Spring Boot actuator infrastructure. What we like to have is a way to cancel an already started observation.

Ideally, there would be some way to signal that this observation is cancelled, so that it doesn't have any effects (don't create timers, cancel running ones, etc.).

With that in place, we could add a feature to the Spring Boot actuator infrastructure that would take the current observation and cancel it. We wouldn't have to fiddle around with context extraction, string matching and so forth.

We should also think about how that cancellation propagates to the parent observations.

What do you think?

jonatan-ivanov commented 1 year ago

Hi 😃

We don't want metrics for the actuator endpoints

I'm not sure it matters a lot but I think the issue on the Boot tracker is slightly different: it is not only about not having metrics for the actuator endpoints but not having Observations.

This logic is brittle and we're not happy with that.

FWIW, as the author of that PR, I'm not happy with that logic either. :D

We already have logic in place where we know exactly that an actuator endpoint has been hit: when handling it in the Spring Boot actuator infrastructure. What we like to have is a way to cancel an already started observation.

I guess this means that the logic you mentioned can only detect that an actuator endpoint was hit after the Observation is started (after the filter). I assume it is not possible to move that logic to the filter and skipping the creation of the Observation in the first place.

What we like to have is a way to cancel an already started observation.

I need to think this through but right now I think it means asking for a lot of trouble and making things very hard (impossible?) for handler implementations (potentially for users).

Right now if you start/stop an Observation, the corresponding handler methods (onStart/onStop) are immediately called. This means that every handler need to handle the situation when onStart is already called and then the Observation is cancelled. I think it is not safe to assume that this is always possible, here are a few problematic examples:

mhalbritter commented 1 year ago

Hey Jonatan,

I'm not sure it matters a lot but I think the issue on the Boot tracker is slightly different: it is not only about not having metrics for the actuator endpoints but not having Observations.

Yes, you're right. It's not only about metrics.

I guess this means that the logic you mentioned can only detect that an actuator endpoint was hit after the Observation is started (after the filter). I assume it is not possible to move that logic to the filter and skipping the creation of the Observation in the first place.

Yes, you're right. The thing which creates the Observations is the ServerHttpObservationFilter from the Spring Framework, which runs for all endpoints and it creates the Observation before we know that it's an actuator endpoint.

For the cancelling propagation: If there would be such a feature, I think the API should provide methods for both only cancelling the current observation and one for also propagating the cancellation to the parent. In our actuator case, we would need a propagating cancellation: it should not only cancel the current observation, but also the one maybe created by spring security in their filter, etc. etc.

I totally agree that cancellation is not always possible without effects. For example if I start an observation, call a remote service and then cancel the Observation. It's not possible to cancel the propagation of the trace to the remote service and there's no way to notify the remote service that it has been cancelled afterwards.

Maybe the cancellation can only be provided on a best-effort basis.

I didn't realize that the DefaultMeterObservationHandler creates LongTaskTimer which already report when onStart is called - this could make the whole cancellation idea fail for our actuator problem.

ttddyy commented 1 year ago

Hi :)

How about rather ServerHttpObservationFilter to take a predicate to determine whether to create an observation? Then, Boot auto-configuration can add a predicate not to create observations for actuator endpoints.

Adding cancellation as best effort sounds fragile and may introduce a hard to debug intermediate state.

wilkinsona commented 1 year ago

As @mhalbritter described above, using a predicate has already been considered but we feel that it's too brittle. Observations are created before any routing has been performed by DispatcherServlet. As a result, any predicate that wants to prevent creation based on that routing has to try to reproduce the routing logic. It's this that's brittle as the potential for subtle differences in the logic is high.

jonatan-ivanov commented 1 year ago

@wilkinsona Is it possible to somehow make that routing logic (or part of it) available to the Filter so that it can find out if an actuator endpoint was hit? (I assume not really but I'm not familiar with that routing at all.)

There is one more thing I was missing from the list above; I think this could not really work for spans either (ignoring the parent-child issue): the span must be started in the onStart method, if a span is created it must be ended, if it is ended, it will be reported. I'm not aware of any way a span can be "cancelled" on the fly in any tracing library (except through a similar "filtering" mechanism that the ObservationPredicate provides).

I think if we introduce a "best effort cancel" functionality, we will still have some metrics and spans even if the Observation is cancelled (+ all the things that users can create and not able to rollback).

ttddyy commented 1 year ago

@wilkinsona

using a predicate has already been considered but we feel that it's too brittle.

Do you mean ObservationPredicate? Sorry, I meant a generic predicate that works for ServerHttpObservationFilter which is a servlet filter.

Here is the sudo code of what I meant:

@Bean
@ConditionalOnMissingFilterBean
public FilterRegistrationBean<ServerHttpObservationFilter> webMvcObservationFilter(
    // if we could get the HandlerMappings for actuator or get all and filter
    ...,  List<HandlerMapping> actuatorHandlerMappings) {

  ...

  Predicate<HttpServletRequest> predicate = (request) -> {

    HandlerExecutionChain chain = getFirstMatchingHandlerMapping(actuatorHandlerMappings, request);

    // If DispatcherServlet could reuse the matched actuator mapping
    // to avoid look up the mappings again.
    request.setAttribute("MATCHED_HANDLER", chain);

    return chain == null;
  };

  ServerHttpObservationFilter filter = new ServerHttpObservationFilter(registry, convention, predicate);
  FilterRegistrationBean<ServerHttpObservationFilter> registration = new FilterRegistrationBean<>(filter);
  ...
  return registration;
}

As a result, any predicate that wants to prevent creation based on that routing has to try to reproduce the routing logic.

So, yes, this idea is also the same, shifting the handler mapping matching logic to the predicate implementation performed by the ServerHttpObservationFilter. Since the creation of the observation happened in the servlet filter, the decision to create the observation needs to happen before that if we want to control the observation creation...

mhalbritter commented 1 year ago

The problem is not only the ServerHttpObservationFilter but every component which creates observations until it hits the actuator endpoint. The way the PR tries solves this issue with an ObservationPredicate works for ServerHttpObservationFilter, but it won't work for custom filters which users install (or even maybe some filters like the various Spring Security ones).

If we want to suppress observations on certain conditions, I think we need a more general solution. Right now, we can suppress them when they get started, but sometimes you don't have enough information at that point to decide that.

wilkinsona commented 1 year ago

Do you mean ObservationPredicate? Sorry, I meant a generic predicate that works for ServerHttpObservationFilter which is a servlet filter.

The exact type of the predicate doesn't matter as the problem's the same.

Since the creation of the observation happened in the servlet filter, the decision to create the observation needs to happen before that if we want to control the observation creation...

Exactly, and therein lies the problem.

this idea is also the same, shifting the handler mapping matching logic to the predicate implementation performed by the ServerHttpObservationFilter

This is what we don't want to do. It's brittle as Spring MVC hasn't really been designed with this in mind. It's slow as it results in the mapping being performed twice. It can also potentially yield the wrong answer as a subsequent filter may complete already the routing. For example, Jersey can be configured as a filter that runs in the filter chain with MVC's dispatcher servlet at the end. Jersey's filter may intercept the request and prevent it from ever reaching MVC.

@mhalbritter summarises the problem nicely:

Right now, we can suppress them when they get started, but sometimes you don't have enough information at that point to decide that.

While I'm sure it's not easy, I think Micrometer needs to solve this problem. Without such a general purpose solution, problems like https://github.com/spring-projects/spring-boot/issues/34801 either won't be addressed or they will be addressed in a variety of different ways that are brittle and not completely reliable.

marcingrzejszczak commented 10 months ago

While I'm sure it's not easy, I think Micrometer needs to solve this problem. Without such a general purpose solution, problems like https://github.com/spring-projects/spring-boot/issues/34801 either won't be addressed or they will be addressed in a variety of different ways that are brittle and not completely reliable.

I see a bigger problem cause metrics can't be cancelled, nor can be an OpenTelemetry span. You can cancel only a Brave span.

Why can't we ask the users to opt-in for URLs that they want to observe? That would be much easier, the Observation filter would be registered only for the provided URL patterns.

wilkinsona commented 10 months ago

Why can't we ask the users to opt-in for URLs that they want to observe?

Various reasons are described above. I'll try to summarise.

If you try to make the decision up front, you either have to write a predicate that matches what the underlying web framework will do or you have to run the web framework's matching logic twice. Both are complicated by the possibility that you don't know which Framework will handle the request if you're using two (such as Spring MVC and Jersey) in the same app.

The only option that we've thought of thus far to avoid these problems is to allow the opt-out to be done retrospectively.

jonatan-ivanov commented 10 months ago

Isn't the same thing true for logging? Once one emits a log event, it is out of the hands of the "instrumentation". Similarly for metrics, once one recorded a value, it's out of the hands of the instrumentation.

We can add a cancel method to the Observation interface but I have no idea how to implement it so that it will not cause more harm than good: data will be recorded and reported once start is called.

wilkinsona commented 10 months ago

I don't think logging's similar. A log event occurs at a specific point in time and once it has been emitted it doesn't change. By contrast, observations and some metrics cover a period of time. During that time more can be learned about the nature of the things that's being observed, such as precisely where an HTTP request will be routed. There's no equivalent with logging as everything's captured when the event's created so the need to cancel it as the nature of the event becomes clear doesn't arise.

jonatan-ivanov commented 9 months ago

Hi Andy,

Sorry for the late reply. I think if you are talking about the Observation as a whole, you are right. From the ObservationHandler perspective though, events are when you start an observation or stop it or signal an error on it. These events (unlike an Observation) are instantaneous (~like log events?). We also have an .event signal where you can attach arbitrary and instantaneous events to an Observation. None of these can be taken back once you signaled them.

I get your point about learning more about Observations as we go that's why you can attach key-values and events after you started an Observation, though we recommend attaching the key-values you know as early as possible but as you cannot "cancel" log.info, observation.start is not "cancellable" either.

Let me show two examples that might demo the issue better than I explained above:

ObservationTextPublisher is a handler that emits log messages when you start/stop/signal an error on an observation. If the observation would be cancelled and stop never called, we would end-up with incomplete logs (I admit, this might not be a huge problem alone).

DefaultMeterObservationHandler starts a LongTaskTimer in the onStop method which is meant to track operations that are in progress. If the observation will be cancelled and stop will be never called that means the operation is still in progress which is not the case since the operation in reality did not even start. It is also a memory leak since LongTaskTimer keeps the record as long as you stop the task.