Publish health check results as events

open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.

https://opentelemetry.io

Apache License 2.0

478 stars 285 forks source link

Publish health check results as events #1855

Open thompson-tomo opened 6 months ago

thompson-tomo commented 6 months ago

Component

None

Is your feature request related to a problem?

Yes. I have implemented health checks within my application and I would like to have the ability to see/visualise the result of these periodic health checks.

What is the expected behavior?

When the periodic health checks complete the results should be able to be captured as OTEL events so that analysis can occur. These events should be distinct results and not be connected to the other results in the batch

Which alternative solutions or features have you considered?

Do a local library

Additional context

No response

martinjt commented 6 months ago

I'm not sure what you're asking here.

ActivityEvent (the equivalent of a SpanEvent) is attached to an Activity, so the health check would appear as an Activity, I'm not sure what you'd be wanting to see as the Event?

Activity is the equivalent of a Span, and since a health check is something that needs to be measured by time, I'm not sure what would be better to represent it.

thompson-tomo commented 6 months ago

What I am thinking is to have a OTEL health check publisher which for each health report received generates a new activity & associates with that activity an event which is linked to the span/activity. This way we can see the events happening ie health checks. It ties into the other issue I have created which looks for conventions covering things like outcome.

martinjt commented 6 months ago

I'm not sure what you mean by "publisher".

Right now, http health checks result in an activity being created, and any of the checks that are part of that health check (third party http, database, etc.) Will generate their own spans that are part of the parent healthcheck span.

Can you give a more concrete example of how that should change? I'm not seeing where events are part of it.

thompson-tomo commented 6 months ago

Maybe this helps: https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/health-checks?view=aspnetcore-8.0#health-check-publisher

In this case, I want to have my health checks being performed periodically which happens automatically on registration of a publisher. The publisher processes a batch of report entry's for all health checks. Each report entry is processed & results in a new activity being generated (unfortunately can't be connected to the original at this stage). Within this activity an event is added to capture the result of the health check.

I have also logged (https://github.com/dotnet/aspnetcore/issues/56028) to make it possible for the report to also contain information to create that link back to the original activity.

martinjt commented 6 months ago

This is interesting and new to me.

What I would expect here is for the publish to be a noop, and the actual act of doing the healthcheck to be individual traces (with the checks part of each trace).

The publish itself is irrelevant since it should be the trace you're looking for to see if the healthcheck succeeded.

So they'd present the same as a HTTP health check, but instead of the root being the HTTP request, it would be an internal span instead.

I don't think a publisher is actually the right way to model it.

thompson-tomo commented 6 months ago

Agree with you that the actual task of doing the health check would be 1 trace but for me the publisher is key. As Having the publisher produce events would mean that absolutely all dotnet health checks automatically gain the the functionality of events being created with the result being the attributes without the need for any changes within the health check. Hence low barrier to entry. The reason I would like the results to be recorded as events is that ideally we could look at alerting based on failed health checks & it ties nicely into the requests I have to port across more functionality from ECS to open telemetry.

martinjt commented 6 months ago

What I'm saying is that the Publish is the Trace, you don't need the PublishAsync part at all. Your healthcheck trace will be the thing you use for alerting on failures etc.

I'll have to look at the background healthcheck thing to see how it's triggering, I'd hope there's a point that you can use to either extend it, or it's already producing an activity there that you can just listen to.

The publish "is" happening in that scenario, because there's a trace for each execution of the healthcheck, and that is being published by the OpenTelemetry SDK as a trace.

Events don't have durations, and are semantically "part of the execution of a span", so even if we were to think that doing something in PublishAsync were the right route, it wouldn't be events, at best we'd be looking at adding Span Links to the Span created by PublishAsync, but I'm not sure how that would help.

thompson-tomo commented 6 months ago

I fully agree about having the trace of the health check for which we can find out the duration etc. In fact I even mentioned that in the aspnet core issue I created & linked above.

I also agree that a span would need to be created for each record being published with links created to the parent hence why I also mentioned that I the aspnet core I created.

In an ideal world, dotnet it self would create an event on the activity itself and internalise what I am proposing here & a user would only need to register a no-op publisher to gain the desired functionality. However I don't forsee it being done anytime soon.

For me events in OTEL are under developed compared to the other concepts and alot could be gained from looking at ECS which is what was prompted a number of the requests I have logged. I feel that events is the best spot to record the result of a health check as theoretically a trace could end up with multiple events ie one for doing a db query & one for the result of the health check.

martinjt commented 6 months ago

You should absolutely have multiple spans for the healthcheck not events on a single span, that wouldn't make sense..

I'm just not seeing how Events on the publish activity would make any sense here. Each of the checks that are performed as part of the Healthcheck has a duration, the time it took to do that check, therefore using something that doesn't have a duration (a span event) wouldn't make sense here.

I can see a way we could workaround the lack of a span using a logging looking for the specific eventIds, but that feels a little hacky.

thompson-tomo commented 6 months ago

Agree not interested in a hack solution. Let me provide an example of what I am thinking of based upon https://opentelemetry.io/docs/concepts/signals/traces/


{
  "name": "database health check",
  "context": {
    "trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "0x051581bf3cb55c13"
  },
  "parent_id": null,
  "start_time": "2022-04-29T18:52:58.114201Z",
  "end_time": "2022-04-29T18:52:58.114687Z",
  "attributes": {
    "http.route": "some_route1"
  },
  "events": [
    {
      "name": "database.healthcheck",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "db.system": "sqlserver"
      },
      "outcome":"success"//request exists
   }.
   {
      {
      "name": "database.query",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "db.system": "sqlserver"
      },
      "outcome":"success"//request exists
    }
  ]
}

With this approach it is clear that thr result of the health check has been published as an event alongside any other events ie querying the db and is part of the same transaction/span.

bunkrur commented 6 months ago

I feel like there are two things going on here that may complicate an approach.

1. The Publisher: I don't think there should be a publisher for this. If the HealthChecks library could simply have an ActivitySource that an instrumentation or existing publisher could listen to, that would drastically simplify things. If there was an easy decoration or activity extension for a HealthCheck, that would make the activity available to lots of consumers without having to wrap the HealthCheck in how a publisher wants to interpret a HealthCheck's run.

2. A HealthCheck Report vs a Trace: Given the JSON example above, I believe you're more interested in a report than an actual trace.

The difference to me is that a trace :

Illustrates the steps (spans) taken during a healthcheck's execution.
The spans show the duration of the operations within the healthcheck.
Spans have a timestamp
Instrumentation will (in most cases) automatically create the peer.service/db.name context when an AsyncLocal interaction calls a dependency.
Spans already have an "outcome" per span with the success attribute (ok,unset,error).
Spans have the ability to showcase an event, which is relative to the span - events usually illustrate steps within a span (enqueue, sent, responsereceived, etc) that do not have a status by default.

Whereas, a HealthCheck Report very close to what you have in your JSON example, where:

There is a total duration
An overall success binary
The events are essentially single tests (entries)
Entries have
- Status
- Duration
- Tags
- Data (Keyval pair)
- etc

Example below:

  "Version": "1.0.2943.0",
  "Status": "Healthy",
  "TotalDuration": "00:00:00.1385633",
  "Pod": "iridebikes-954c5788bb-jtkkl",
  "HealthChecks": {
    "ClusterHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0323811"
    },
    "DependencyHealthCheck": {
      "Status": "Healthy",
      "Description": "All function app resources are Healthy. Count: 1",
      "Duration": "00:00:00.0008600"
    },
    "AuthenticationHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.1381652"
    },
    "MqttListenerHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0038396"
    },
    "MemoryHealthCheck": {
      "Status": "Healthy",
      "Description": "97.01% is MORE than the threshold 25% of available memory.",
      "Duration": "00:00:00.0000814"
    },
    "azurekeyvault": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0088268"
    },
    "DoucmentThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0056534"
    },
    "OtherThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0053188"
    }
  }
}

And the reports dont illustrate when specific cascading activities happened within the test run, like your example.

Usually, you can use a solution (Application Insights Webtests, some other Synthetic HealthCheck monitoring, etc) to store the body of the test results when it calls into the /health/report endpoint or wherever you map your healthcheck report. The Application Insights Publisher is an example where the report is consumed and mapped to an Application Insights event (table entry) and published from the report itself: https://github.com/Xabaril/AspNetCore.Diagnostics.HealthChecks/blob/2f6a10f89e3c6d97f232f4157a80e3a9e1470dc5/src/HealthChecks.Publisher.ApplicationInsights/ApplicationInsightsPublisher.cs#L56

If I was to make a similar request to increase the observability for HealthChecks in dotnet, it would to be to:

Decorate a HealthCheck as a Trace, and let the diagnostic listeners for my application's insturmentation work as intended to generate spans based on the run of the Healthcheck
Have a toggle to disable generating Traces from HealthCheck runs
Have a similar activity listener name to how the Meter Source is done for HealthCheck Metrics
Have no opinions about how the trace is formatted and let me add my own attributes/events as needed.

thompson-tomo commented 5 months ago

I think a key difference is that i am not seeing a need for health checks to be grouped together in a single as your report showcases but much more powerful for me to have them associated with the trace as an event that way you can see all the spans which resulted in the health check returning the result which it did. If was to have a report like illustrated i would be requesting the following:

  "Version": "1.0.2943.0",
  "Status": "Healthy",
  "TotalDuration": "00:00:00.1385633",
  "Pod": "iridebikes-954c5788bb-jtkkl",
  "HealthChecks": {
    "ClusterHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0323811",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df0",
       }
    },
    "DependencyHealthCheck": {
      "Status": "Healthy",
      "Description": "All function app resources are Healthy. Count: 1",
      "Duration": "00:00:00.0008600",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
       }
    },
    "AuthenticationHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.1381652",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df1",
       }
    },
    "MqttListenerHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0038396",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df4",
       }
    },
    "MemoryHealthCheck": {
      "Status": "Healthy",
      "Description": "97.01% is MORE than the threshold 25% of available memory.",
      "Duration": "00:00:00.0000814",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df5",
       }
    },
    "azurekeyvault": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0088268",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df7",
       }
    },
    "DoucmentThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0056534",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df6",
       }
    },
    "OtherThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0053188",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df9",
       }
    }
  }
}

However could easily be constructed by the reporting/visualisation tool using the trace to get the context, environment, duration & the remaining health check info comes from the event.

This approach also enable visualisation/reporting tools to create alerts based on the events, events could be aggregated to form Merics.

I do agree if we could do-away with enabling a publisher however the question then becomes how can a developer mantain control if the health checks are being completed periodically at the same time, i believe Microsoft making changes to the core are quite a while away as it is hard to shine a spotlight on suggestions like this.

One thing which is missing from your example is the data property which for me are a one to one mapping of the semantic attributes in open telemetry as you search for all sql database events and then filter to see the health checks or alternatively see the db command events.

johncrim commented 4 months ago

I generally agree with most of both of your comments @thompson-tomo and @bunkrur, but I disagree with the title of this issue. I think health checks should be spans, not events; but events within the health check span makes sense.

I've modelled health checks as 1 span/activity for all health checks in a "run internal checks" cycle, and 1 span/activity for each health check, using the .NET health check framework.

I have had to write my own exporter to app insights AvailabilityResults, since that table isn't supported in the Azure Monitor otel exporter. App insights does have some decent reporting on availability using this table, so having the data there is useful. It seems this schema was originally intended for external availability checks (eg fetching a URL from global servers), but it is also the best match in the app insights schemas for health check telemetry.

In general I want to do a modified version of SuppressInstrumentation for telemetry within the health check - so when the health check passes inner telemetry isn't reported, but when the health check fails all inner telemetry is included. (I haven't implemented this yet.)

I do agree that health check telemetry should be standardized and supported as a platform capability, it doesn't make sense for so much duplicate effort to be expended.

wertzui commented 4 months ago

I think this might give out false "healthy" statuses. Imagine the following scenario:

Health check request goes out to the service via http.
Health check request reaches the service.
Health check executes successfully.
The service sends back the http response.
The service reports the OTEL Span to the collector without any error.
The response of the health check request got somehow lost on its way back (maybe a firewall).
The originator of the health check requests reports the SPAN with an exception to the collector
On the way to the collector the span a. successfully reaches the collector b. gets lost

In scenario 8.a., if the user now only searches for the health check events and not the overarching trace, they might get the false impression that the resource is healthy. In scenario 8.b., even if the user checks the overarching trace, they cannot see the error easily.

But if only the originator of the health check reports back the outcome of a health check, the user can easily see if the trace is there or not and what the outcome of the total trace is.

johncrim commented 4 months ago

I've implemented something similar to this. In short, I wanted our health checks to be span/otel/Activity aware, so that they all are in a tree, and any other activities within each health check are tracked in the tree. Here's an example:

In order to implement this within the .NET health check framework, and with Application Insights, I had to implement 2 main classes:

HealthCheckService The new health check service wraps each health check in an Activity, and also provides a top-level Activity for all the health checks. It replaces the DefaultHealthCheckService, which is ILogger-oriented and doesn't provide any OpenTelemetry support. So the logic is equivalent, but the telemetry is completely different.
HealthCheckPublisher The health check publisher is both an IHealthCheckPublisher, and listener to the Activities exposed by the HealthCheckService. It combines the info from both, and maps the health check and activity info to AvailabilityTelemetry that is sent to app insights (classic - the Azure Monitor exporter doesn't support this AI telemetry type).

The HealthCheckPublisher was more difficult/more fragile than it needed to be due to the HealthCheck framework preventing attaching other info to HealthReport and HealthReportEntry.

To re-iterate my point above, I think the correct approach is to make HealthChecks spans, not events. I can share or publish my HealthCheckService as a replacement for the DefaultHealthCheckService - I think that would make sense as part of this project, because it OpenTelemetry-ifies part of the .NET framework. Let me know if there's interest in me contributing that.

The HealthCheckPublisher should really be part of an Exporter. The issue I see in getting this fully integrated is that we would need to map health check properties to defined otel conventions/schema.

thompson-tomo commented 3 months ago

I think this might give out false "healthy" statuses.

Imagine the following scenario:

Health check request goes out to the service via http.

Health check request reaches the service.

Health check executes successfully.

The service sends back the http response.

The service reports the OTEL Span to the collector without any error.

The response of the health check request got somehow lost on its way back (maybe a firewall).

The originator of the health check requests reports the SPAN with an exception to the collector

On the way to the collector the span

a. successfully reaches the collector

b. gets lost

In scenario 8.a., if the user now only searches for the health check events and not the overarching trace, they might get the false impression that the resource is healthy.

In scenario 8.b., even if the user checks the overarching trace, they cannot see the error easily.

But if only the originator of the health check reports back the outcome of a health check, the user can easily see if the trace is there or not and what the outcome of the total trace is.

I feel the big difference is that for me the health check results published as events should not be initiated via a http request but instead is handled via periodic publisher which dotnet supports natively. This would allow the data to be graphed over time.

thompson-tomo commented 3 months ago

To re-iterate my point above, I think the correct approach is to make HealthChecks spans, not events.

Yes I do agree a span should be produced for each individual health-check. This should be provided out of the box by dotnet. I think I even created an issue for it.

Where I do differ is this span should have an event which encompasses the health check result. That way we can react to an unhealthy event

martinjt commented 3 months ago

What would be on the event, that wouldn't be attributes on the span? Would there be more than one event on the span? And if so, what would they contain that child spans wouldn't contain as attributes?

The span would either be in response to a http request, so it would be a HTTP span, or it would be trigger when the background healthcheck triggered.

This would allow the data to be graphed over time.

I think this part is more about leaking backend capability into the application layer. There are a lot of backends that don't have the capability to graph span data, but do allow that for log data.

The defaults out of an applications shouldn't cater for those backends. Potentially they could allow extension points that allow you convert them if you want.

thompson-tomo commented 3 months ago

For me the event should resemble the health report entity which is generated as part of the health check. That way the event bring captured is the health check result, all transport etc is managed via the span