kubernetes-sigs / controller-runtime

Repo for the controller-runtime subproject of kubebuilder (sig-apimachinery)
Apache License 2.0
2.55k stars 1.15k forks source link

Consider using OpenTelemetry for metrics instead of Prometheus #305

Open grantr opened 5 years ago

grantr commented 5 years ago

Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.

Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.

The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).

Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.

And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.

It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.

DirectXMan12 commented 5 years ago

/kind feature

I do kind-of like this idea. Want to put together a PoC?

DirectXMan12 commented 5 years ago

/priority backlog

DirectXMan12 commented 5 years ago

/good-first-issue

grantr commented 5 years ago

Working on this now! /assign

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

DirectXMan12 commented 5 years ago

/remove-lifecycle stale

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

DirectXMan12 commented 5 years ago

/lifecycle frozen

negz commented 5 years ago

I was just exploring controller-runtime's observability stance, noticed the kubebuilder book mentioned Prometheus metrics, thought "oh man I wish they used Opencensus instead", came here to thumbs-up this issue, then found past me already did that. ;)

Throwing my two cents as a controller-tools user who has instrumented systems in the past with both Prometheus's SDK and Opencensus's I vastly prefer the latter for basically all the reasons @grantr outlines in this issue.

DirectXMan12 commented 5 years ago

I'm going to add

/help

here. I'm thinking we might just want to push ahead with this. At this point, it's probably worth trying to make use of OpenTelemetry (the merged version of OpenCensus & OpenTracing), since that's the future of these efforts (https://github.com/open-telemetry/opentelemetry-go).

If anybody's willing to draw up a rough design of how things'd look and put forward a prototype, I'd be happy to review.

vincepri commented 4 years ago

@grantr are you still interested in working on this?

/lifecycle frozen

grantr commented 4 years ago

I think it's still valuable, but I don't have bandwidth to work on it. https://github.com/kubernetes-sigs/controller-runtime/pull/368 is the final state of my attempt.

hasheddan commented 4 years ago

I would love to help out here and will get started on implementation :)

/assign

ncdc commented 4 years ago

@hasheddan is this something you started on?

hasheddan commented 4 years ago

@ncdc I unfortunately have not gotten around to it yet, but would still love to help out. Don't want to block anyone else who is already getting started though :+1:

ncdc commented 4 years ago

No worries, just checking so folks don't step on your toes! Thanks for the update.

vincepri commented 4 years ago

@bboreham This group is interested in getting OpenTracing in Controller Runtime

cc @DirectXMan12

bboreham commented 4 years ago

@vincepri presume you meant OpenTelemetry. I opened a PR to make it easier to see what I've done: #1211

evankanderson commented 4 years ago

A few comments from experience in Knative:

If you're just starting to add instrumentation, try to use the same data in the context for both (structured) logging, tracing, and metrics. This has three benefits:

Design your Resource schema ahead of time. For many applications, the application acts a single tenant and only a single Resource is needed. Kubernetes controllers are not the typical application; I'd suggest defaulting to a separate Resource for each Kubernetes Resource, on the theory that k8s Resources probably represent an "entity which produces telemetry". In some cases for short-lived resources (for example, a TaskRun in Tekton), it may make sense to associate the resource with the parent (i.e. Task in Tekton).

Note that using Metrics and Traces in a multi-resource way requires passing provider options to NewAccumulator/NewTracerProvider... you probably want to cache these (on the context?), rather than creating a new one every time.

DirectXMan12 commented 4 years ago

We've got some work done for us in that regard from @dashpole's work on the apiserver tracing KEP, so we should consider that as well

alvaroaleman commented 3 years ago

Removing good first issue label, as it is not quite clear if we want this and if so, how it would look. /remove-good-first-issue