Add metric def and metrics query endpoints

jaegertracing / jaeger-idl

A set of shared data model definitions used by Jaeger components.

http://jaegertracing.io/

Apache License 2.0

82 stars 72 forks source link

Add metric def and metrics query endpoints #73

Closed albertteoh closed 3 years ago

albertteoh commented 3 years ago

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

The first of multiple PRs addressing: https://github.com/jaegertracing/jaeger/issues/2736

Short description of the changes

The API specification for the jaeger-query /metrics endpoint, containing sub paths leading to specific R.E.D metrics aggregations.
Design goals:
- Align the API to the requirements for the use case which are to support retrieval of R.E.D metrics aggregated on a per-service level and on a per-operation level (when a service_name is selected).
- Aim for a simplicity and intuitiveness.
Sample HTTP requests; for each of these examples, optional parameters can be added such as endTs=<posix millis>&lookback=<duration millis>&step=<duration millis>&rate=<duration millis>:
- curl -i http://localhost:16686/api/metrics/latencies?quantile=0.95
- curl -i http://localhost:16686/api/metrics/calls
- curl -i http://localhost:16686/api/metrics/errors
- curl -i http://localhost:16686/api/metrics/latencies/myservice?quantile=0.95
- curl -i http://localhost:16686/api/metrics/calls/myservice
- curl -i http://localhost:16686/api/metrics/errors/myservice
Import OTEL's metrics wire format, with an addition of BucketUpperBound; motivation added as comment in protobuf.
- Rationale: Considered using OpenMetrics' protobuf wire format. However, given OpenTelemetry's goal of supporting both Prometheus and StatsD the hope is that adopting OpenTelemetry's data model will enable us to more easily support non-Prometheus compatible metrics backends in future with less effort and lower likelihood of breaking changes; albeit at the cost of less stability in the short-term. Having said that, OpenTelemetry's spec is more than adequate for our relatively simple requirements of using just 2 well-supported metrics types: Counters and Histograms. As such, I don't foresee a dire need to maintain parity with OpenTelemetry over the coming months as it will no doubt continue evolving towards a more stable API and data model.
- Taking the import instead of copy approach of referencing otel's metrics proto to reduce maintenance overhead and chances for diverging versions.

albertteoh commented 3 years ago

Thanks for the quick feedback, @yurishkuro.

I don't think it's necessary to mix metrics stuff with the primary model and QueryService, we can do them as separate files.

To confirm, you'd also like the metrics rpc methods in a separate service outside of QueryService (i.e. a new "MetricsQueryService")?

Placing data model in this repo signals long term support, I think it's premature. We can host the files directly in Jaeger repo, as we used to do with api_v2. Especially because afaik the OTEL metrics model is not fixed yet.

Okay, I can move the service/rpc methods into the Jaeger repo. Would model/proto be a good home for these?

Yes, that's correct, OTEL metrics spec is in the works with a stable version around September 2021, if I'm not mistaken.

Could we import OTEL model somehow instead of copying it?

Yes, I think we can import it. Is the best approach to add otel metrics proto in https://github.com/jaegertracing/docker-protobuf, which make proto (from jaegertracing/jaeger) references for generating Go code?

fwiw, the main reason for copying was to workaround the Inf marshalling to JSON, but I think there's a better way to do this without needing to modify the OTEL metrics data model.

Open-ended prom query string makes me anxious, as it exposes a very wide api. Can we not do something more specific and narrow?

One thing I could do is remove the custom type, which is a true open-ended prom query.

However, with calls, errors and latency, we'd want to give the caller the ability to search for one or more timeseries within the restricted set of known R.E.D. metric names (specifically, calls_total, latency_count, latency_sum and latency_bucket), which could contain any number of labels. Callers will not be able to query any other metric name.

Another possible restriction is to enforce strict equality matching of labels instead of allowing any promQL label filter expression within the range of available operators.

So instead of allowing something like calls_total{service_name=~".*foo.*"}, callers would only be able to do calls_total{service_name="foo"}

yurishkuro commented 3 years ago

To confirm, you'd also like the metrics rpc methods in a separate service outside of QueryService (i.e. a new "MetricsQueryService")?

yes.

yurishkuro commented 3 years ago

Okay, I can move the service/rpc methods into the Jaeger repo. Would model/proto be a good home for these?

yeah, it's fine. When it's internal in the repo it's much easier to move around if we need to.

Yes, I think we can import it. Is the best approach to add otel metrics proto in https://github.com/jaegertracing/docker-protobuf, which make proto (from jaegertracing/jaeger) references for generating Go code?

Another option is to create an otel-idl submodule, similar to idl that links to jaeger-idl, and arrange the build to import from there. Submodule still gives us versioning against future changes in OTEL. Adding to docker-protobuf is a good idea (as submodules are easy to add, hard to remove), it's just a bit more friction on getting upgrades.

Re the query, I would suggest trying to formulate the exact requirements and define an IDL just for that, without resorting to free-style query strings like calls_total{service_name="foo"}. I think this service has very specific requirements about which time series it should support.

About using the enum - please check that enums are extensible in Protobuf. In Thrift they are not extensible without sync-upgrade of client and server.

albertteoh commented 3 years ago

Adding to docker-protobuf is a good idea (as submodules are easy to add, hard to remove), it's just a bit more friction on getting upgrades.

I don't imagine we would need to upgrade frequently, if at all. The existing data model offered by OpenTelemetry satisfies all our current requirements, unless of course we wish to introduce new requirements that depend on a new data model.

Re the query, I would suggest trying to formulate the exact requirements and define an IDL just for that, without resorting to free-style query strings like calls_total{service_name="foo"}. I think this service has very specific requirements about which time series it should support.

There are specific requirements relating to the metric name restricted to just "calls" (in which "errors" belong) and "latencies". However there are no requirements/constraints surrounding supported time series within these two metric groups. For example, one time series could be calls_total{service_name=foo}, another calls_total{service_name=bar} and another calls_total{service_name=foo, operation=/foo}, and all three of these are unique and separate time series. The caller's query should be able to fetch any of these, returning 0 to many time series along with their data points.

However, I agree that we should remove the promQL query syntax, and I propose that we adopt the same API as jaeger-query currently exposes for FindTraces tags filter that uses a map<string, string> (e.g. {"service_name": "foo", "operation": "/foo"}) which achieves the following goals:

Still satisfies our functional requirement of allowing users to filter by metric dimensions/labels.
Constrains API.
Makes the API explicit and easier to infer what values to pass rather than just a vague "string".
Consistent experience with an existing jaeger-query API, which is tried and tested.
Agnostic to a particular tech/query language (prometheus/promQL) and defer the construction of the actual query to the implementing reader layer.

About using the enum - please check that enums are extensible in Protobuf. In Thrift they are not extensible without sync-upgrade of client and server.

I tested this out and it appears to be extensible. Either client or server side can be upgraded (new fields added to an enum). If the new field/s are used by either side while the other hasn't been upgraded, the enum int would still be received without error, but not available in code and so we'll need to handle this gracefully.

yurishkuro commented 3 years ago

For example, one time series could be calls_total{service_name=foo}, another calls_total{service_name=bar} and another calls_total{service_name=foo, operation=/foo}, and all three of these are unique and separate time series.

What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.) Then in the prom-based implementation of the API map them to specific prom metrics as needed. You can even have the implementation take a configuration file that contains the mapping from logical API terms into exact metric queries.

albertteoh commented 3 years ago

What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.) Then in the prom-based implementation of the API map them to specific prom metrics as needed. You can even have the implementation take a configuration file that contains the mapping from logical API terms into exact metric queries.

Yes, if I understand you correctly, this is precisely the plan with the enum types. Some examples of mappings from logical API to promQL queries (documented here):

CALLS http://localhost:16686/api/metrics?type='calls'&labels='service_name="currencyservice",operation="/GetConversion"' -> calls_total{service_name="currencyservice", operation="/GetConversion"}

ERRORS http://localhost:16686/api/metrics?type='errors'&labels='service_name="currencyservice",operation="/GetConversion"' -> calls_total{service_name="currencyservice", operation="/GetConversion", status_code="ERROR"}

LATENCY http://localhost:16686/api/metrics?type='latency'&labels='service_name="currencyservice",operation="/GetConversion"' -> latency_buckets{service_name="currencyservice", operation="/GetConversion"} -> latency_count{service_name="currencyservice", operation="/GetConversion"} -> latency_sum{service_name="currencyservice", operation="/GetConversion"}

Some additional comments:

The key-value labels are synonymous with how we do tag-based searches in Jaeger.
It's not a simple 1:1 mapping between enum and metrics and so I'm not sure about making it configurable. The goal is to make the API as logically simple as possible and defer the more complex nuances of fetching the correct metrics to the underlying implementation.

albertteoh commented 3 years ago

Please ignore the last couple of pushes, they were accidental.

albertteoh commented 3 years ago

@yurishkuro I've removed the the copied otel metrics wire format and imported it from the newly created MetricsQueryService as discussed.

As per the recommendation in https://github.com/open-telemetry/opentelemetry-proto:

The proto files can be consumed as GIT submodule or copied over and built directly in the consumer project.

Since we prefer to import rather than copy the data model, I've tested the approach of creating a submodule referencing opentelemetry-proto and managed to get it working; albeit with a workaround required for the import references, which I'll show and discuss in a subsequent PR.

albertteoh commented 3 years ago

Setting this PR to draft as I'll need to think about how users can take advantage of promQL's functions such as quantile calculations and rates, which I think would be key requirements to a potential future Jaeger UI.

albertteoh commented 3 years ago

What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.)

@yurishkuro I've applied your suggestion to align the API more closely to the requirements as part of the aggregated views on R.E.D. metrics.

The API now provides "pre-packaged" promQL queries for specific use cases via dedicated rpc functions/URL paths, instead of placing the onus on the caller to build the queries themselves; so it looks more explicit (while being configurable through optional parameters) and intuitive, and has the added benefit of decoupling it from prometheus; thank you for the advice!

Please let me know what you think and if it could be improved further.

albertteoh commented 3 years ago

Placing data model in this repo signals long term support, I think it's premature. We can host the files directly in Jaeger repo, as we used to do with api_v2. Especially because afaik the OTEL metrics model is not fixed yet.

I may have misunderstood you here, thinking that if we were to copy the OTEL metrics model, that you'd prefer just the copied data model in the Jaeger repo, and leave the MetricsQueryService here in the jaeger-idl repo given this is the home of the existing jaeger-query API definitions.

However, if you prefer the changes in this PR to be in the Jaeger repo as well as it isn't yet stable, please let me know and I'll look into moving it there.