vercel / otel

OTEL tracing for Vercel
https://vercel.com/docs/observability/otel-overview
28 stars 8 forks source link

Instrumentation causes high cardinality when converted to metrics #120

Open cedricziel opened 3 weeks ago

cedricziel commented 3 weeks ago

What

This package uses span names such as GET /_next/static/webpack/4330e83422f91a86.webpack.hot-update.json

OpenTelemetry explicitly specifies this here:

HTTP span names SHOULD be {method} {target} if there is a (low-cardinality) target available. If there is no (low-cardinality) {target} available, HTTP span names SHOULD be {method}.

(see below for the exact definition of the [{method}](https://opentelemetry.io/docs/specs/semconv/http/http-spans/#method-placeholder) and [{target}](https://opentelemetry.io/docs/specs/semconv/http/http-spans/#target-placeholder) placeholders).

The {method} MUST be {http.request.method} if the method represents the original method known to the instrumentation. In other cases (when {http.request.method} is set to _OTHER), {method} MUST be HTTP.

The {target} SHOULD be one of the following:

[http.route](https://opentelemetry.io/docs/specs/semconv/attributes-registry/http/) for HTTP Server spans
[url.template](https://opentelemetry.io/docs/specs/semconv/attributes-registry/url/) for HTTP Client spans if enabled and available (Experimental)
Other value MAY be provided through custom hooks at span start time or later.
Instrumentation MUST NOT default to using URI path as a {target}.

Why

Any downstream consumer that generates metrics from traces will experience a metrics spike if the span names are high-cardinality. This is, as metric generators use the span.name and transform it into a label on the metric.

This is a severe issue and potentially has massive cost implications for many users

Suggested solution

It is common practice in the Otel instrumentation libraries to work with placeholders and templatized URLs.

Examples for next.js:

GET /_next/static/webpack/4330e83422f91a86.webpack.hot-update.json

could become

GET /_next/static/{resource}

A templated route with parameters:

GET /api/repos/grafana/faro-web-sdk

could become

GET /api/repos/[org]/[repo]

A common practice is for instrumentations to collect a http.route attribute and on span end, update the span name with the contents of http.route, if set. Documentation

glifchits commented 1 week ago

I believe we're encountering a similar issue. After playing around with the sample app in this repo, I'm finding that this issue of high cardinality operations may be particular to NextJS 13.

In the sample app I downgraded to Next ^13, set up some example routes in the pages router, and created some <Link> components to obtain some operations from NextJS client side app rendering. I have the vercel/opentelemetry-collector-dev-setup running and am looking at Jaeger to see how the instrumentation works for the sample app. I've pushed my app changes here: https://github.com/glifchits/otel/pull/1 (to recap, only src/pages and package.json are net new)

When the app is using Next 14, the operations are grouped under the expected route pages-router/[slug] and pages-router but when the app is using Next 13, I'm seeing operations including _next/data.

Next 13 Next 14
image image