Closed albertteoh closed 3 years ago
Thanks for the quick feedback, @yurishkuro.
I don't think it's necessary to mix metrics stuff with the primary model and QueryService, we can do them as separate files.
To confirm, you'd also like the metrics rpc methods in a separate service outside of QueryService (i.e. a new "MetricsQueryService")?
Placing data model in this repo signals long term support, I think it's premature. We can host the files directly in Jaeger repo, as we used to do with api_v2. Especially because afaik the OTEL metrics model is not fixed yet.
Okay, I can move the service/rpc methods into the Jaeger repo. Would model/proto be a good home for these?
Yes, that's correct, OTEL metrics spec is in the works with a stable version around September 2021, if I'm not mistaken.
Could we import OTEL model somehow instead of copying it?
Yes, I think we can import it. Is the best approach to add otel metrics proto in https://github.com/jaegertracing/docker-protobuf, which make proto
(from jaegertracing/jaeger) references for generating Go code?
fwiw, the main reason for copying was to workaround the Inf
marshalling to JSON, but I think there's a better way to do this without needing to modify the OTEL metrics data model.
Open-ended prom query string makes me anxious, as it exposes a very wide api. Can we not do something more specific and narrow?
One thing I could do is remove the custom
type, which is a true open-ended prom query.
However, with calls
, errors
and latency
, we'd want to give the caller the ability to search for one or more timeseries within the restricted set of known R.E.D. metric names (specifically, calls_total
, latency_count
, latency_sum
and latency_bucket
), which could contain any number of labels. Callers will not be able to query any other metric name.
Another possible restriction is to enforce strict equality matching of labels instead of allowing any promQL label filter expression within the range of available operators.
So instead of allowing something like calls_total{service_name=~".*foo.*"}
, callers would only be able to do calls_total{service_name="foo"}
To confirm, you'd also like the metrics rpc methods in a separate service outside of QueryService (i.e. a new "MetricsQueryService")?
yes.
Okay, I can move the service/rpc methods into the Jaeger repo. Would model/proto be a good home for these?
yeah, it's fine. When it's internal in the repo it's much easier to move around if we need to.
Yes, I think we can import it. Is the best approach to add otel metrics proto in https://github.com/jaegertracing/docker-protobuf, which make proto (from jaegertracing/jaeger) references for generating Go code?
Another option is to create an otel-idl
submodule, similar to idl
that links to jaeger-idl
, and arrange the build to import from there. Submodule still gives us versioning against future changes in OTEL. Adding to docker-protobuf is a good idea (as submodules are easy to add, hard to remove), it's just a bit more friction on getting upgrades.
Re the query, I would suggest trying to formulate the exact requirements and define an IDL just for that, without resorting to free-style query strings like calls_total{service_name="foo"}
. I think this service has very specific requirements about which time series it should support.
About using the enum - please check that enums are extensible in Protobuf. In Thrift they are not extensible without sync-upgrade of client and server.
Adding to docker-protobuf is a good idea (as submodules are easy to add, hard to remove), it's just a bit more friction on getting upgrades.
I don't imagine we would need to upgrade frequently, if at all. The existing data model offered by OpenTelemetry satisfies all our current requirements, unless of course we wish to introduce new requirements that depend on a new data model.
Re the query, I would suggest trying to formulate the exact requirements and define an IDL just for that, without resorting to free-style query strings like
calls_total{service_name="foo"}
. I think this service has very specific requirements about which time series it should support.
There are specific requirements relating to the metric name restricted to just "calls" (in which "errors" belong) and "latencies".
However there are no requirements/constraints surrounding supported time series within these two metric groups. For example, one time series could be calls_total{service_name=foo}
, another calls_total{service_name=bar}
and another calls_total{service_name=foo, operation=/foo}
, and all three of these are unique and separate time series. The caller's query should be able to fetch any of these, returning 0 to many time series along with their data points.
However, I agree that we should remove the promQL query syntax, and I propose that we adopt the same API as jaeger-query currently exposes for FindTraces tags filter that uses a map<string, string>
(e.g. {"service_name": "foo", "operation": "/foo"}
) which achieves the following goals:
About using the enum - please check that enums are extensible in Protobuf. In Thrift they are not extensible without sync-upgrade of client and server.
I tested this out and it appears to be extensible. Either client or server side can be upgraded (new fields added to an enum). If the new field/s are used by either side while the other hasn't been upgraded, the enum int would still be received without error, but not available in code and so we'll need to handle this gracefully.
For example, one time series could be calls_total{service_name=foo}, another calls_total{service_name=bar} and another calls_total{service_name=foo, operation=/foo}, and all three of these are unique and separate time series.
What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.) Then in the prom-based implementation of the API map them to specific prom metrics as needed. You can even have the implementation take a configuration file that contains the mapping from logical API terms into exact metric queries.
What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.) Then in the prom-based implementation of the API map them to specific prom metrics as needed. You can even have the implementation take a configuration file that contains the mapping from logical API terms into exact metric queries.
Yes, if I understand you correctly, this is precisely the plan with the enum types. Some examples of mappings from logical API to promQL queries (documented here):
CALLS
http://localhost:16686/api/metrics?type='calls'&labels='service_name="currencyservice",operation="/GetConversion"'
-> calls_total{service_name="currencyservice", operation="/GetConversion"}
ERRORS
http://localhost:16686/api/metrics?type='errors'&labels='service_name="currencyservice",operation="/GetConversion"'
-> calls_total{service_name="currencyservice", operation="/GetConversion", status_code="ERROR"}
LATENCY
http://localhost:16686/api/metrics?type='latency'&labels='service_name="currencyservice",operation="/GetConversion"'
-> latency_buckets{service_name="currencyservice", operation="/GetConversion"}
-> latency_count{service_name="currencyservice", operation="/GetConversion"}
-> latency_sum{service_name="currencyservice", operation="/GetConversion"}
Some additional comments:
labels
are synonymous with how we do tag-based searches in Jaeger.Please ignore the last couple of pushes, they were accidental.
@yurishkuro I've removed the the copied otel metrics wire format and imported it from the newly created MetricsQueryService
as discussed.
As per the recommendation in https://github.com/open-telemetry/opentelemetry-proto:
The proto files can be consumed as GIT submodule or copied over and built directly in the consumer project.
Since we prefer to import rather than copy the data model, I've tested the approach of creating a submodule referencing opentelemetry-proto and managed to get it working; albeit with a workaround required for the import references, which I'll show and discuss in a subsequent PR.
Setting this PR to draft as I'll need to think about how users can take advantage of promQL's functions such as quantile calculations and rates, which I think would be key requirements to a potential future Jaeger UI.
What I would suggest is have a logical description of these metrics in the service API, similar to that enum (CALLS vs. ERRORS vs. LATENCY, etc.)
@yurishkuro I've applied your suggestion to align the API more closely to the requirements as part of the aggregated views on R.E.D. metrics.
The API now provides "pre-packaged" promQL queries for specific use cases via dedicated rpc functions/URL paths, instead of placing the onus on the caller to build the queries themselves; so it looks more explicit (while being configurable through optional parameters) and intuitive, and has the added benefit of decoupling it from prometheus; thank you for the advice!
Please let me know what you think and if it could be improved further.
Placing data model in this repo signals long term support, I think it's premature. We can host the files directly in Jaeger repo, as we used to do with api_v2. Especially because afaik the OTEL metrics model is not fixed yet.
I may have misunderstood you here, thinking that if we were to copy the OTEL metrics model, that you'd prefer just the copied data model in the Jaeger repo, and leave the MetricsQueryService here in the jaeger-idl repo given this is the home of the existing jaeger-query API definitions.
However, if you prefer the changes in this PR to be in the Jaeger repo as well as it isn't yet stable, please let me know and I'll look into moving it there.
Signed-off-by: albertteoh albert.teoh@logz.io
Which problem is this PR solving?
Short description of the changes
The API specification for the jaeger-query
/metrics
endpoint, containing sub paths leading to specific R.E.D metrics aggregations.Design goals:
Sample HTTP requests; for each of these examples, optional parameters can be added such as
endTs=<posix millis>&lookback=<duration millis>&step=<duration millis>&rate=<duration millis>
:curl -i http://localhost:16686/api/metrics/latencies?quantile=0.95
curl -i http://localhost:16686/api/metrics/calls
curl -i http://localhost:16686/api/metrics/errors
curl -i http://localhost:16686/api/metrics/latencies/myservice?quantile=0.95
curl -i http://localhost:16686/api/metrics/calls/myservice
curl -i http://localhost:16686/api/metrics/errors/myservice
Import OTEL's metrics wire format, with an addition of
BucketUpperBound
; motivation added as comment in protobuf.