libp2p / go-libp2p

libp2p implementation in Go
MIT License
6.05k stars 1.07k forks source link

Metrics instrumentation #522

Closed vyzo closed 1 year ago

vyzo commented 5 years ago

We need to instrument the various subsystems for metrics collection. This issue is here to track progress towards instrumentation.

Subsystems that need to be instrumented:

cc @Stebalien @raulk @bigs @anacrolix

anacrolix commented 5 years ago

What are the metrics interfaces to be supported? Where is this recorded?

Stebalien commented 5 years ago

We use prometheus in ipfs and it works pretty well. We use the interfaces in go-metrics-interface and the implementation in go-metrics-prometheus so we can swap out metrics implementations.

bigs commented 5 years ago

yeah, let's continue down the prometheus/grafana train. as a note, i'd love to see our metrics interface extended to support "labels", arbitrary KV pairs. these labels will allow us to get deeper insight into our relatively abstract gauges/counters/timers. two things i'd love to see instrumented off the bat:

anacrolix commented 5 years ago

See also https://github.com/libp2p/go-libp2p-kad-dht/issues/304, https://github.com/libp2p/go-libp2p-kad-dht/pull/297, https://github.com/libp2p/go-libp2p-kad-dht/issues/256, https://github.com/libp2p/go-libp2p-daemon/pull/92 and a small survey of metrics package choices by @bigs.

anacrolix commented 5 years ago

I was initially in favour of OpenCensus, however after experimentation, here are my observations on it:

raulk commented 5 years ago

OC includes tracing: I'm not sure that's actually a good thing, and perhaps we can ignore it. I'm not sure all this stuff needs to be in a single implementation.

One advantage of using OC for both metrics and traces is that it automatically saves "exemplar" traces for values in each histogram bucket.

I also asked for a clarification of the direction they're taking with the whole metrics and stats duality: https://github.com/census-instrumentation/opencensus-specs/issues/253. You didn't mention this, but the opacity there is unnerving, and is definitely a drawback for me.

No support yet for Gauges, or GaugeFunc in particular. It's coming,

It's not obvious from the issue that they understood you were asking about gauge functions. I've explicitly asked in a follow-up question.


Representing {Phantom}Drift (libp2p visualiser), we can make the metrics export for the introspection protocol work with whichever system.

I do think that lazily computed gauge functions are important to us, and that doesn't even appear to be on the roadmap or sensical for OC because they aspire to be primarily push-based (encouraging use of OpenCensus Agent/Service). Not sure if I'd tag them as a dealbreaker, but certainly close to it.

Also, the maturity and stability of Prometheus should not go undervalued. OpenCensus has to mature a great deal, I'm sure there will be significant API changes (as evidenced by the stats vs. metrics thing now), and if we choose it, it would be more of a risky bet on the grounds of seeing potential that we're willing to support with time and effort, versus choosing the proven, robust option.

OTOH, if we want to incorporate tracing, automatic "examplars" are a nicety we only get with OC, but not the panacea either.

raulk commented 5 years ago

OC does support lazy metrics calculation via thunks: https://github.com/census-instrumentation/opencensus-go/issues/1080#issuecomment-477407665

raulk commented 5 years ago

OC Metrics/Stats difference here: https://github.com/census-instrumentation/opencensus-specs/issues/253#issuecomment-477400658

lanzafame commented 5 years ago

No support for histograms. Maybe that's not such a problem, they're sort of not recommended, but it's a thing.

An OpenCensus Distribution provides you with the data buckets to create a histogram, example prometheus query histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Prometheus has much better community penetration. There are various packages that expose things for you through Prometheus, which would require starting all over with OC, or juggling multiple metrics systems to try to get everything exported correctly.

Please link to something that would take no more than 2 hours to port from Prometheus to OpenCensus, as evidence of this point. On top of this, you can still register these tools when using OC. Snippet from cluster:

    // setup Prometheus
    registry := prom.NewRegistry()
    goCollector := prom.NewGoCollector()
    procCollector := prom.NewProcessCollector(prom.ProcessCollectorOpts{})
    registry.MustRegister(goCollector, procCollector, prommod.NewCollector("ipfs-cluster"))
    pe, err := prometheus.NewExporter(prometheus.Options{
        Namespace: "cluster",
        Registry:  registry,
    })
    if err != nil {
        return err
    }

The OC form of things is much longer winded, and favours exposing a lot more detail. (To justify separate metrics from views, we felt we needed a separate package so as not to "waste" them. At least all views needed to be exposed, we couldn't just provide a bundle, or automatically register them like in Prometheus. There are also interactions with contexts, various forms of tag mutators, multiple packages etc.)

This may be a pain for you as developers of an upstream package, but that is what makes the metrics of your package documented and accessible to your downstream users.

primarily push-based

Regardless of the OC Agent/Service aspect, which is secondary to this discussion, OC is neither push nor pull as that is based on the backend that you configure it to use. If you configure prometheus as the metrics backend, you still end up with the metrics endpoint that prometheus then scrapes.

it would be more of a risky bet on the grounds of seeing potential that we're willing to support with time and effort

I don't think OC is going to need your support, its come out of Google and now has Microsoft support. Not a guarantee of success by any means but this isn't some vapourware project that we are talking about.

choosing the proven, robust option

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Edit: removed points addressed between refreshes of issue. Added code snippet.

raulk commented 5 years ago

@lanzafame

I don't think OC is going to need your support, its come out of Google and now has Microsoft support.

This sounds like appeal to authority, and it's a poor interpretation of my argument. They won't need us, but it's undeniable that we are investing more time and effort in wrapping our heads around OC than around Prometheus. The takeaway for several of us is that it's immature and heavily in evolution, and the communication around ongoing changes isn't great (e.g. stats vs. metrics thing). So yes, it's an investment from our side to keep up with their evolution, much more than just going with stable Prometheus.

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Can you point me to a project that relies – or builds upon – OpenMetrics?

Just to put some magnitude on this. Number of packages importing each:

raulk commented 5 years ago

So far, these are my conclusions. With a few details clarified (e.g. lazy gauges), they appear to be at feature parity. The devil might be in the details, though.

Note that this evaluation is for the METRICS functionality. For tracing, I think we're onboard with using OpenCensus.

Prometheus:

👍 Safe bet, mature. Widely used in Go. Mature instrumentation for Node, Rust, Java, Python, C++.

👎 Ties us to a specific metrics backend (even if there are converters out there), export format is textual.

OpenCensus:

👍 Open standard not tied to any particular solution (good for libp2p; as a library we should not impose a world view onto our users); segways synergistically into tracing; push support.

👎 In heavy evolution, little traction since 2018, heavily WIP, configuring to export to Prometheus in a specific way proved debatably challenging, verbose configuration, not great DX due to lack of docs and opacity around changes (e.g. stats vs. metrics thing), will require active participation and involvement upstream. Multi-language support still in the works (many libraries are alpha; not good for libp2p). Seems like a leap of faith (which we might want to take); all backends supported by OpenCensus already support Prometheus scraping; key features we want to use are still experimental and with big fat warnings of API instability.


Let's hear another round of feedback, then propose a way forward by Friday. Whichever one it is, let's set a checkpoint in 1 month to evaluate how it's playing out. We will mark this feature as experimental to convey to our users things are subject to change under the hood.

If in 1 month we find that we're subjectively unhappy with the choice, the system gets in our way, certain features don't work as advertised, DX is bad, we feel frustrated, etc. We can either get involved upstream, or make a switch if we lack the capacity. I doubt we'll have any of these issues with Prometheus.


EDIT: added new pros/cons; clarified this evaluation is for the metrics aspect.

lanzafame commented 5 years ago

This sounds like appeal to authority, and it's a poor interpretation of my argument.

I clearly added a caveat to clarify this wasn't an appeal to authority but a rebuttal to the implication that OC was some small-time project.

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Can you point me to a project that relies – or builds upon – OpenMetrics?

You misunderstood, I am agreeing that Prometheus is robust hence OpenMetrics taking and extending their data model, with OC implementing the more generalised standard.

Just to put some magnitude on this. Number of packages importing each:

OpenCensus stats package: 137 importers. Prometheus: 8047 importers.

To add context to your numbers:

OpenCensus released: Janurary 2018 Prometheus release: November 2012

configuring to export to Prometheus in a specific way proved debatably challenging

Evidence of this please? I had no issue doing so in Cluster...

(e.g. stats vs. metrics thing)

Can you please explain what the issue is here? I don't have any trouble understanding the difference.

raulk commented 5 years ago

Just to put some magnitude on this. Number of packages importing each:

OpenCensus stats package: 137 importers. Prometheus: 8047 importers.

To add context to your numbers:

OpenCensus released: Janurary 2018 Prometheus release: November 2012

Can't say I'm impressed by the traction. EDIT: Not implying it's not a useful investment, I'm just pointing out we'd be very early adopters, and we don't have a seat at the table.

configuring to export to Prometheus in a specific way proved debatably challenging

Evidence of this please? I had no issue doing so in Cluster...

@anacrolix, could you point to concrete code samples? I was summarising the first bullet point from your feedback.

(e.g. stats vs. metrics thing)

Can you please explain what the issue is here? I don't have any trouble understanding the difference.

Contextualising the citation:

not great DX due to lack of docs and opacity around changes (e.g. stats vs. metrics thing)

What I'm saying is that these are two disjoint ways of generating metrics, and if I were an OC user that had invested into complying with their stats API, I'd be blind-sided. I understand the difference now after asking and digging into source: their docs aren't clear at all, and this area is not spec'ed. That's what I mean with my remark.

anacrolix commented 5 years ago

No support for histograms. Maybe that's not such a problem, they're sort of not recommended, but it's a thing.

I meant summaries, which was in the linked document.

Please link to something that would take no more than 2 hours to port from Prometheus to OpenCensus, as evidence of this point. On top of this, you can still register these tools when using OC. Snippet from cluster:

That's not a good argument, as much as I love implementing stuff.

Regardless of the OC Agent/Service aspect, which is secondary to this discussion, OC is neither push nor pull as that is based on the backend that you configure it to use. If you configure prometheus as the metrics backend, you still end up with the metrics endpoint that prometheus then scrapes.

https://godoc.org/go.opencensus.io/stats/view#SetReportingPeriod and https://github.com/anacrolix/go-libp2p-dht-tool/blob/feat/oc-metrics/main.go#L374

anacrolix commented 5 years ago

@raulk there are a lot of code links further down in my summary. Regarding the overhead, it took me most of a day to convert a small number of metrics, from OC to Prometheus. Here are some conversions from OC metrics to Prometheus for go-libp2p-dht-tool, and for go-libp2p-kad-dht. Note that the Prometheus side of things has a few more features in the DHT, as I've continued adding metrics there. Some of the file diffs are collapsed.

raulk commented 5 years ago

After carefully weighing pros and cons, and speaking to a number of stakeholders (incl. @frrist and @scout today) for alignment, I recommend to go with OpenCensus for both metrics and tracing, and really strive to make it work.

We talked about a 1-month evaluation period, but I want us to try as hard as possible to make this a success. For me, that pathway is an ultimate backstop/escape hatch, and we'll need to justify it clearly and quantitively if we use that card.

Here are my closing notes:


Thank you all for the constructive input! Let's instrument all the things! 💪

bigs commented 5 years ago

thanks for collecting all of this, @raulk! sounds good to me!

lanzafame commented 5 years ago

@raulk on the k8s front, searching the org does turn up results, but only on the tracing front, granted k8s just rebuilt their metrics collection tooling, heapster -> metrics-server.

Also, I have mentioned this elsewhere but to reiterate, I am more than happy to help out on any rough edges and getting this implemented smoothly across all the orgs.

Thanks for diving deep on this, sorry if the discussion got too passionate at times 🙏

marten-seemann commented 1 year ago

Closing. The current metrics effort is tracked in #1356.