Add client spans around calls to upstream

huggsboson commented 11 months ago

Is your feature request related to a problem? Please describe

Services usually have a server span upon request entry and client span (span.kind = client) around calls to upstream.

This helps with a few things:

You can diagnose whether time is being spent or errors in the service itself or its upstreams.
Tracing products use this information for useful visualisations e.g. "inferred services" for services that haven't adopted tracing it still shows up as a dependency.

Describe the solution you'd like

Output a new span on each calls to upstream
Allow adding attributes/baggage only to that client span

Describe alternatives you've considered

Putting a proxy between nginx and its upstreams.

vladimirkokshenev commented 11 months ago

Nginx has a set of variables associated used to measure time: $request_time, $session_time, $connection_time, $upstream_connect_time, $upstream_first_byte_time, $upstream_header_time, $upstream_response_time, $upstream_queue_time, $upstream_session_time. One can use these variables in spans to get visibility into that.

The $request_time variable is a non-cacheable variable. This means that nginx computes it every time you call it. And this can be used in a very powerful way. If you call the $request_time variable at the right moment during the request processing (and you can do it during different moments), you can get additional measurements.

For example, consider the following configuration:

server {
...
    map $request_time $proxy_delay {
        default $request_time;
    }

    map $proxy_delay $stub {
        default '';
    }
...
    location / {
        proxy_pass http://u/;
        proxy_set_header x-stub $stub;
    }
}

In the nginx configuration provided, we combine the non-cacheable variable $request_time with cacheable variables $proxy_delay and $stub from the map statement. We calculate $request_time only once during the calculation of $stub variable for proxy_set_header directive when we form a request to upstream. As we need to cache $request_time value to reuse during the log phase (if you refer it nginx-otel span configuration), we wrap it with the map from $proxy_delay. map variables are cacheable, so the value of $proxy_delay is cached when we call it the first time (in our example, we do it when we reached the point where we prepare a request to be proxied to upstream). In order to avoid sending unnecessary header “x-stub”, we use another map wrapper that always maps to the empty value (‘’). Proxy_set_header directive does not add an empty header to a proxied request.

Using such patter one can get a rich set of time measurements. There are some internals though that useful to know. For example, nginx computes time once per even loop pass. So this is not an exact precision. But in 99.99% cases it should be good enough.

Will this solve the use cases your shared?

huggsboson commented 11 months ago

It sounds like a possible workaround, but creating client spans around your upstream calls is the standard way of doing these things and like I mentioned a ton of observability tools have built useful visualizations around that. Is there context/a reason creating extra spans at the appropriate moments (especially client spans) isn't a good fit for nginx?

p-pautov commented 8 months ago

This is reasonable request, but Nginx doesn't provide clear "hook" points around upstream calls in all cases. The only option I can think of so far is to introduce new otel_trace_upstream directive in upstream block, but such directive will be required in every upstream block and it won't cover cases without explicit upstream... PR linked above seems to be adding client spans inside server spans, not around calls to upstream, so both of these spans will have the same timings.

pyohannes commented 8 months ago

I agree that the ability to create client spans would be a very useful feature.

Many experience around OpenTelemetry traces are built on the assumption that one service calling another service synchronously is modeled by a parent client span and a child server span. HTTP client instrumentation creates a client span, HTTP server framework instrumentations create a server span (client -> server).

If you now have nginx between the instrumented HTTP client and the instrumented HTTP server framework, you end up with a span structure client -> server -> server, which for most tools and backends doesn't make sense, and can break experiences like service graphs. A span structure like client -> server (nginx) -> client (nginx) -> server allows for a much better experience, and cleanly integrate nginx as an intermediate service.

PR linked above seems to be adding client spans inside server spans, not around calls to upstream, so both of these spans will have the same timings.

It would be great to have correct timings, however, given the status quo having client and server spans with the same timings are the lesser evil (as opposed to the client -> server -> server span structure).

bmerigan commented 8 months ago

I found this while searching for information about adding spans to subrequests.

In my situation: frontend -> nginx -> 2 subrequests -> proxy to backend

p-pautov commented 7 months ago

@bmerigan, subrequests tracing might deserve its own issue. Also, can you share config example so it's clearer what those subrequests are used for (e.g. authentication)?

bmerigan commented 7 months ago

@p-pautov yes it is indeed for authentication (oauth2) and authorization (opa). I'd like those 2 sub_requests to get recorded as well.

dkrizic commented 7 months ago

Just to demonstrate this problem, here is an example of how Azure Monitor is connecting ingress-nginx with my backend:

As you can see, frontend (yasm-frontend) is talking to ingress-nginx, but the connection from ingress-nginx to backend (yasm-backend) is not being drawn.

Same for Grafana/Tempo:

and backend separately:

cedricziel commented 1 month ago

@dkrizic has a very good illustration why multiple spans are needed for traditional service graph metrics.

I think we can assess that the current behavior of the extension is not spec-conformant and that it is not implementation specific. I think the minimum this extension should do is to adhere to the spec so downstream consumers can rely on the spec. Deviating from the spec is currently not an option for Grafana Labs and I would wonder if the service graph connector in the otel collector, the canonical servicegraph generator in Otel, would consider deviating - probably not.

We [Grafana Labs] see a couple of use-cases in our user-base that are directly related to this ticket and nginx technology:

with the current semantics in the nginx module, the service graph metrics are basically not usable as the way that nginx uses "just" server spans, breaks the calculation that's rooted in the otel specification
our users would love to understand the central role that nginx plays in their landscape and the performance when using Server Side Includes (SSI). A possible solution to this problem includes generating a kind=client span when calling a 3rd party system for the SSI documents and propagating context in these calls
generating additional client spans when reaching out to a proxy upstream so there is a spec conformant trace they can inspect

Any help with these would be greatly appreciated - there are proprietary extensions to nginx that can do all of the above and it would change the game for open-source rooted observability with nginx if that was possible by default.

Why is this issue so important?

Let me spend a few characters on why we think this should be a priority: nginx is a strategic and central component in many infrastructure settings. It is, because the technology is versatile, reliable and efficient. - However with how the otel implementation is set up for now, it is not observable as it will break any trace and service graph - our recommendation can only be to not use the extension as long as the extension does not conform to the spec. - "Fixing" this, would make this the go-to recommendation for any Otel user, a viable option and probably increase adoption.

nginxinc / nginx-otel