Allowance for using Stored Operation IDs for Semantic conventions for GraphQL Server

magicmark commented 1 year ago

What are you trying to achieve?

As an alternative to using the GraphQL query name as the span name, I'd like to use its hash (stored operation ID) instead.

Context

We're looking at aligning our GraphQL infra with the following convention:

https://opentelemetry.io/docs/specs/otel/trace/semantic_conventions/instrumentation/graphql/

However, we use document IDs (a sha256 of the query) to uniquely identify queries across our infra. Our org does not allow arbitrary queries at all - every graphql request is known ahead of time, and the query will live somewhere in our codebase.

Why does this matter in the context of the span name?

Query names are not unique (different queries can be defined with the same name across different platforms, or even older/multiple versions of the same query) leading to confusion when trying to find the query in source code.

My concern is this:

Imagine a situation where one of our oncall engineers (who is unfamiliar with the GQL stack) is trying to diagnose a slow or failing operation - they might reasonably use the operation name (query name) as the lookup key when trying to find the query in our codebase to debug further. As mentioned above, this is very susceptible to finding the wrong thing.

"Just add the operation ID as a tag/attribute"

While it is possible to do this, there's still a footgun here in that folks have to know to click into the sapn to copy the operation ID instead of the span name.

Generally we try and build a path for folks to guide them to do the right thing by default - and the "right thing by default" here would be putting the operation name as the span name imo.

The ask

For GraphQL Servers that respect operation IDs only, the span name guidance could allow for using this instead?

FAQs

What about batching

This is actually a different but related point: the first span to our GQL server will be an HTTP request containing multiple queries

https://www.apollographql.com/blog/apollo-client/performance/query-batching

In a case where a server supports batching, I would propose this (where the span name is inside the square brackets)

-[ POST /graphql ]
----[ some-long-sha-256-hash-foo ]
----[ some-long-sha-256-hash-bar ]
----[ some-long-sha-256-hash-baz ]

What about Automatic Persisted Queries?

Apollo support Automatic Persisted Queries (APQ) which is a sort of hybrid. The server supports arbitrary queries, with the query name being the correct and only way to identify them. But subsequent requests have their raw query bodies substituted for IDs instead (which the server has stored in a lookup table). My understanding is that in this setup, IDs are not intended to be used as globally unique identifiers - and more similar to a cache key instead.

tl;dr servers using Apollo APQ should keep the status quo.

maraisr commented 1 year ago

What a great start to the discussion, thanks! ❤️

I'd like to offer you my perspective and say yeah, I think it is entirely reasonable to include the query hash in the span name, though with a preference to the OperationName.

Much like with REST, and including the resource name as the span name, eg PUT /my/resource. The equivalent for GraphQL could either be the OperationName or the query identifier.

Both could work, as in to put the query name in the span name, with its id as an attribute or the inverse.

[ 123ms ] |--------------|    * POST /graphql
[ 123ms ]     |----|          * 79c2f6ee6f496da71e2262cf786b6c13
[ 123ms ]        |----|       * Query.ProductCard

Though can say, I do prefer the query name as the span name, so a quick glance I know what is going on.

benjie commented 12 months ago

Just a thought, but why not both? OperationName.hash has the advantage of being human readable (mostly) whilst also fixing the uniqueness problem.

trask commented 11 months ago

@open-telemetry/technical-committee can you transfer this issue to https://github.com/open-telemetry/semantic-conventions? thx

Oberon00 commented 11 months ago

I am not sure this needs to be moved, as it seems to be a general question about the span name concept. The span name is not supposed to uniquely identify a code location (although in principle, it can). If you want that, there are code.* attributes

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span

The span name concisely identifies the work represented by the Span, for example, an RPC method name, a function name, or the name of a subtask or stage within a larger computation. The span name SHOULD be the most general string that identifies a (statistically) interesting class of Spans, rather than individual Span instances while still being human-readable. That is, "get_user" is a reasonable name, while "get_user/314159", where "314159" is a user ID, is not a good name due to its high cardinality. Generality SHOULD be prioritized over human-readability.

SonjaChevre commented 11 months ago

hey @magicmark! just fyi I have created an issue to suggest working on better GraphQL semantic conventions - and linked your interesting use case: https://github.com/open-telemetry/semantic-conventions/issues/182

Feel free to add feedback in there as well!

open-telemetry / semantic-conventions

Allowance for using Stored Operation IDs for Semantic conventions for GraphQL Server #1011

The ask

FAQs