interledger / rafiki

An open-source, comprehensive Interledger service for wallet providers, enabling them to provide Interledger functionality to their users.
https://rafiki.dev/
Apache License 2.0
249 stars 89 forks source link

Investigate `backend` GraphQL resolvers histogram #2802

Closed mkurapov closed 2 months ago

mkurapov commented 3 months ago

(Blocked by #2808)

Context

We need to determine how long each of our GraphQL APIs take so we can get an estimate on how performant each of the operations are, and get an relative idea of how long a payment might take.

There is a few options of doing this, either manually instrumenting histograms per resolver, or since adding traces in #2808, we can potentially automatically generate metrics from Tempo as well via https://grafana.com/docs/tempo/latest/metrics-generator/

The metrics generation from Tempo should include generating histograms.

This ticket will be looking into seeing how we can do that/what will be easier.

BlairCurrey commented 2 months ago

When running with telemetry and viewing the local dashboard after firing of some GQL requests, I see metrics like http_client_duration_milliseconds_bucket. We could use something like this to make histograms or get performance for different percentiles.

image

However, I don't see this for the graphql resolvers, but I'm not sure why. That's what I'm currently investigating. I tried changing the config for the gql instrumentation but that didnt change anything. I expect the instrumentation should produce similar metrics, in which case we dont need to add explicit histogram metrics using our telemetry service.

BlairCurrey commented 2 months ago

OK, metric generation from spans working here: https://github.com/interledger/rafiki/tree/bc/2802/investigate-span-metrics-generator

We can use the traces_spanmetrics_latency_bucket metric and span_label (mutation CreateQuote for example) to visualize the resolver durations as histograms. image

Just to follow up on my previous comment, it seems like for whatever reason the HTTPInstrumentation just collects these metrics by default whereas graphql does not.