open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.
https://opentelemetry.io
Apache License 2.0
421 stars 251 forks source link

Publish health check results as events #1855

Open thompson-tomo opened 3 weeks ago

thompson-tomo commented 3 weeks ago

Component

None

Is your feature request related to a problem?

Yes. I have implemented health checks within my application and I would like to have the ability to see/visualise the result of these periodic health checks.

What is the expected behavior?

When the periodic health checks complete the results should be able to be captured as OTEL events so that analysis can occur. These events should be distinct results and not be connected to the other results in the batch

Which alternative solutions or features have you considered?

Do a local library

Additional context

No response

martinjt commented 3 weeks ago

I'm not sure what you're asking here.

ActivityEvent (the equivalent of a SpanEvent) is attached to an Activity, so the health check would appear as an Activity, I'm not sure what you'd be wanting to see as the Event?

Activity is the equivalent of a Span, and since a health check is something that needs to be measured by time, I'm not sure what would be better to represent it.

thompson-tomo commented 3 weeks ago

What I am thinking is to have a OTEL health check publisher which for each health report received generates a new activity & associates with that activity an event which is linked to the span/activity. This way we can see the events happening ie health checks. It ties into the other issue I have created which looks for conventions covering things like outcome.

martinjt commented 3 weeks ago

I'm not sure what you mean by "publisher".

Right now, http health checks result in an activity being created, and any of the checks that are part of that health check (third party http, database, etc.) Will generate their own spans that are part of the parent healthcheck span.

Can you give a more concrete example of how that should change? I'm not seeing where events are part of it.

thompson-tomo commented 3 weeks ago

Maybe this helps: https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/health-checks?view=aspnetcore-8.0#health-check-publisher

In this case, I want to have my health checks being performed periodically which happens automatically on registration of a publisher. The publisher processes a batch of report entry's for all health checks. Each report entry is processed & results in a new activity being generated (unfortunately can't be connected to the original at this stage). Within this activity an event is added to capture the result of the health check.

I have also logged (https://github.com/dotnet/aspnetcore/issues/56028) to make it possible for the report to also contain information to create that link back to the original activity.

martinjt commented 3 weeks ago

This is interesting and new to me.

What I would expect here is for the publish to be a noop, and the actual act of doing the healthcheck to be individual traces (with the checks part of each trace).

The publish itself is irrelevant since it should be the trace you're looking for to see if the healthcheck succeeded.

So they'd present the same as a HTTP health check, but instead of the root being the HTTP request, it would be an internal span instead.

I don't think a publisher is actually the right way to model it.

thompson-tomo commented 3 weeks ago

Agree with you that the actual task of doing the health check would be 1 trace but for me the publisher is key. As Having the publisher produce events would mean that absolutely all dotnet health checks automatically gain the the functionality of events being created with the result being the attributes without the need for any changes within the health check. Hence low barrier to entry. The reason I would like the results to be recorded as events is that ideally we could look at alerting based on failed health checks & it ties nicely into the requests I have to port across more functionality from ECS to open telemetry.

martinjt commented 3 weeks ago

What I'm saying is that the Publish is the Trace, you don't need the PublishAsync part at all. Your healthcheck trace will be the thing you use for alerting on failures etc.

I'll have to look at the background healthcheck thing to see how it's triggering, I'd hope there's a point that you can use to either extend it, or it's already producing an activity there that you can just listen to.

The publish "is" happening in that scenario, because there's a trace for each execution of the healthcheck, and that is being published by the OpenTelemetry SDK as a trace.

Events don't have durations, and are semantically "part of the execution of a span", so even if we were to think that doing something in PublishAsync were the right route, it wouldn't be events, at best we'd be looking at adding Span Links to the Span created by PublishAsync, but I'm not sure how that would help.

thompson-tomo commented 3 weeks ago

I fully agree about having the trace of the health check for which we can find out the duration etc. In fact I even mentioned that in the aspnet core issue I created & linked above.

I also agree that a span would need to be created for each record being published with links created to the parent hence why I also mentioned that I the aspnet core I created.

In an ideal world, dotnet it self would create an event on the activity itself and internalise what I am proposing here & a user would only need to register a no-op publisher to gain the desired functionality. However I don't forsee it being done anytime soon.

For me events in OTEL are under developed compared to the other concepts and alot could be gained from looking at ECS which is what was prompted a number of the requests I have logged. I feel that events is the best spot to record the result of a health check as theoretically a trace could end up with multiple events ie one for doing a db query & one for the result of the health check.

martinjt commented 3 weeks ago

You should absolutely have multiple spans for the healthcheck not events on a single span, that wouldn't make sense..

I'm just not seeing how Events on the publish activity would make any sense here. Each of the checks that are performed as part of the Healthcheck has a duration, the time it took to do that check, therefore using something that doesn't have a duration (a span event) wouldn't make sense here.

I can see a way we could workaround the lack of a span using a logging looking for the specific eventIds, but that feels a little hacky.

thompson-tomo commented 3 weeks ago

Agree not interested in a hack solution. Let me provide an example of what I am thinking of based upon https://opentelemetry.io/docs/concepts/signals/traces/


{
  "name": "database health check",
  "context": {
    "trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "0x051581bf3cb55c13"
  },
  "parent_id": null,
  "start_time": "2022-04-29T18:52:58.114201Z",
  "end_time": "2022-04-29T18:52:58.114687Z",
  "attributes": {
    "http.route": "some_route1"
  },
  "events": [
    {
      "name": "database.healthcheck",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "db.system": "sqlserver"
      },
      "outcome":"success"//request exists
   }.
   {
      {
      "name": "database.query",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "db.system": "sqlserver"
      },
      "outcome":"success"//request exists
    }
  ]
}

With this approach it is clear that thr result of the health check has been published as an event alongside any other events ie querying the db and is part of the same transaction/span.

bunkrur commented 2 weeks ago

I feel like there are two things going on here that may complicate an approach.

1. The Publisher: I don't think there should be a publisher for this. If the HealthChecks library could simply have an ActivitySource that an instrumentation or existing publisher could listen to, that would drastically simplify things. If there was an easy decoration or activity extension for a HealthCheck, that would make the activity available to lots of consumers without having to wrap the HealthCheck in how a publisher wants to interpret a HealthCheck's run.

2. A HealthCheck Report vs a Trace: Given the JSON example above, I believe you're more interested in a report than an actual trace.

The difference to me is that a trace :

image

Whereas, a HealthCheck Report very close to what you have in your JSON example, where:

Example below:

  "Version": "1.0.2943.0",
  "Status": "Healthy",
  "TotalDuration": "00:00:00.1385633",
  "Pod": "iridebikes-954c5788bb-jtkkl",
  "HealthChecks": {
    "ClusterHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0323811"
    },
    "DependencyHealthCheck": {
      "Status": "Healthy",
      "Description": "All function app resources are Healthy. Count: 1",
      "Duration": "00:00:00.0008600"
    },
    "AuthenticationHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.1381652"
    },
    "MqttListenerHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0038396"
    },
    "MemoryHealthCheck": {
      "Status": "Healthy",
      "Description": "97.01% is MORE than the threshold 25% of available memory.",
      "Duration": "00:00:00.0000814"
    },
    "azurekeyvault": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0088268"
    },
    "DoucmentThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0056534"
    },
    "OtherThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0053188"
    }
  }
}

And the reports dont illustrate when specific cascading activities happened within the test run, like your example.

Usually, you can use a solution (Application Insights Webtests, some other Synthetic HealthCheck monitoring, etc) to store the body of the test results when it calls into the /health/report endpoint or wherever you map your healthcheck report. The Application Insights Publisher is an example where the report is consumed and mapped to an Application Insights event (table entry) and published from the report itself: https://github.com/Xabaril/AspNetCore.Diagnostics.HealthChecks/blob/2f6a10f89e3c6d97f232f4157a80e3a9e1470dc5/src/HealthChecks.Publisher.ApplicationInsights/ApplicationInsightsPublisher.cs#L56

If I was to make a similar request to increase the observability for HealthChecks in dotnet, it would to be to:

thompson-tomo commented 2 weeks ago

I think a key difference is that i am not seeing a need for health checks to be grouped together in a single as your report showcases but much more powerful for me to have them associated with the trace as an event that way you can see all the spans which resulted in the health check returning the result which it did. If was to have a report like illustrated i would be requesting the following:

  "Version": "1.0.2943.0",
  "Status": "Healthy",
  "TotalDuration": "00:00:00.1385633",
  "Pod": "iridebikes-954c5788bb-jtkkl",
  "HealthChecks": {
    "ClusterHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0323811",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df0",
       }
    },
    "DependencyHealthCheck": {
      "Status": "Healthy",
      "Description": "All function app resources are Healthy. Count: 1",
      "Duration": "00:00:00.0008600",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
       }
    },
    "AuthenticationHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.1381652",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df1",
       }
    },
    "MqttListenerHealthCheck": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0038396",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df4",
       }
    },
    "MemoryHealthCheck": {
      "Status": "Healthy",
      "Description": "97.01% is MORE than the threshold 25% of available memory.",
      "Duration": "00:00:00.0000814",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df5",
       }
    },
    "azurekeyvault": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0088268",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df7",
       }
    },
    "DoucmentThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0056534",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df6",
       }
    },
    "OtherThing": {
      "Status": "Healthy",
      "Description": null,
      "Duration": "00:00:00.0053188",
      "Context":{
         "Trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df9",
       }
    }
  }
}

However could easily be constructed by the reporting/visualisation tool using the trace to get the context, environment, duration & the remaining health check info comes from the event.

This approach also enable visualisation/reporting tools to create alerts based on the events, events could be aggregated to form Merics.

I do agree if we could do-away with enabling a publisher however the question then becomes how can a developer mantain control if the health checks are being completed periodically at the same time, i believe Microsoft making changes to the core are quite a while away as it is hard to shine a spotlight on suggestions like this.

One thing which is missing from your example is the data property which for me are a one to one mapping of the semantic attributes in open telemetry as you search for all sql database events and then filter to see the health checks or alternatively see the db command events.