Open grantr opened 5 years ago
/kind feature
I do kind-of like this idea. Want to put together a PoC?
/priority backlog
/good-first-issue
Working on this now! /assign
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/lifecycle frozen
I was just exploring controller-runtime's observability stance, noticed the kubebuilder book mentioned Prometheus metrics, thought "oh man I wish they used Opencensus instead", came here to thumbs-up this issue, then found past me already did that. ;)
Throwing my two cents as a controller-tools user who has instrumented systems in the past with both Prometheus's SDK and Opencensus's I vastly prefer the latter for basically all the reasons @grantr outlines in this issue.
I'm going to add
/help
here. I'm thinking we might just want to push ahead with this. At this point, it's probably worth trying to make use of OpenTelemetry (the merged version of OpenCensus & OpenTracing), since that's the future of these efforts (https://github.com/open-telemetry/opentelemetry-go).
If anybody's willing to draw up a rough design of how things'd look and put forward a prototype, I'd be happy to review.
@grantr are you still interested in working on this?
/lifecycle frozen
I think it's still valuable, but I don't have bandwidth to work on it. https://github.com/kubernetes-sigs/controller-runtime/pull/368 is the final state of my attempt.
I would love to help out here and will get started on implementation :)
/assign
@hasheddan is this something you started on?
@ncdc I unfortunately have not gotten around to it yet, but would still love to help out. Don't want to block anyone else who is already getting started though :+1:
No worries, just checking so folks don't step on your toes! Thanks for the update.
@bboreham This group is interested in getting OpenTracing in Controller Runtime
cc @DirectXMan12
@vincepri presume you meant OpenTelemetry. I opened a PR to make it easier to see what I've done: #1211
A few comments from experience in Knative:
If you're just starting to add instrumentation, try to use the same data in the context for both (structured) logging, tracing, and metrics. This has three benefits:
Record(ctx, measurement)
, Info(ctx, "message")
, Start(ctx, "spanname")
), Design your Resource schema ahead of time. For many applications, the application acts a single tenant and only a single Resource is needed. Kubernetes controllers are not the typical application; I'd suggest defaulting to a separate Resource for each Kubernetes Resource, on the theory that k8s Resources probably represent an "entity which produces telemetry". In some cases for short-lived resources (for example, a TaskRun in Tekton), it may make sense to associate the resource with the parent (i.e. Task in Tekton).
Note that using Metrics and Traces in a multi-resource way requires passing provider options to NewAccumulator
/NewTracerProvider
... you probably want to cache these (on the context?), rather than creating a new one every time.
We've got some work done for us in that regard from @dashpole's work on the apiserver tracing KEP, so we should consider that as well
Removing good first issue label, as it is not quite clear if we want this and if so, how it would look. /remove-good-first-issue
Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.
Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.
The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).
Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.
And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.
It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.