Improve GraphQL semantic conventions

SonjaChevre commented 1 year ago

As GraphQL is gaining popularity as a query language for APIs, we (Tyk Technologies, maintainer of the Tyk open source API Gateway) would like to work on enhancing the existing semantic conventions for GraphQL instrumentation.

What is GraphQL?

GraphQL was created by Facebook in 2012 and was publicly released in 2015; it gained popularity due to its ability to solve data fetching challenges by providing a more efficient and declarative approach to API data querying and manipulation.

More about GraphQL:

GraphQL | A query language for your API GraphQL Landscape

What are specific observability challenges with GraphQL?

Here is a non exhaustive list:

1. Error detection

In GraphQL, errors are returned as part of the data response with a 200 HTTP status code, even in the case of partially successful queries. When monitoring a GraphQL request with OpenTelemetry, this means that the distributed trace usually look ok (because of the 200 HTTP status code) even when GraphQL is returning errors.

What is the current support of the GraphQL in OTel?

Currently, the semantic conventions for GraphQL contains three attributes:

operation.name: The name of the operation being executed.
operation.type: The type of the operation being executed (query, mutation or subscription)
document: The GraphQL document being executed.

There are currently 5 instrumentation libraries for GraphQL:

We have only tried the Node.js instrumentation so far, but noticed that this library doesn’t respect the semantic conventions, but contain much more valuable information that could be standardised.

What are we missing in the semantic conventions?

Non exhaustive list:

GraphQL errors (error location, error message, … see the GraphQL specification GraphQL )
GraphQL query lifecycle (parse, validate, execute, resolve)
Fields (name, path, alias, …)

What is the suggested approach?

We are actively working on adding this information to our own GraphQL engine (Universal Data Graph) and would welcome other member of the observability and GraphQL community to join us on improving the semantic convention.

Looking forward to see if this proposal gets any interest!

Sonja

Note: we are also working on another proposal to introduce semantic conventions for API Gateways: https://github.com/open-telemetry/semantic-conventions/issues/183

michaelstaib commented 1 year ago

Hi @SonjaChevre,

I am a GraphQL TSC and the author of the .NET GraphQL server HotChocolate.

In .NET we have a far more extensive implementation of OTel for GraphQL that allows for resolver level instrumentation and also covers the request pipeline. I agree that current proposed definitions is not enough.

For resolvers we only cover report relevant resolvers, basically have span for resolvers that cause IO. Since HotChocolate uses execution plans we also go beyond traditional GraphQL concerns. Interested in connecting?

SonjaChevre commented 1 year ago

Hi @michaelstaib - yes please! what's the best way to connect (CNCF slack, e-mail, ...)?

benjie commented 1 year ago

But deprecating fields in GraphQL can become challenging. Deprecating a field can be considered a breaking change, potentially disrupting the functionality of client applications that rely on the deprecated field. Being able to observe which fields are being requested by API clients can help understand the impact of deprecating fields.

Just a note that deprecating fields should not cause any issues for existing queries/clients (i.e. the statement "Deprecating a field can be considered a breaking change" should not be true). Removing deprecated fields could cause issues, which is why monitoring of which fields are actually used is important should you wish to do so. Some GraphQL APIs will deprecate fields but never remove them - essentially telling new clients not to use those fields, but still supporting them for old clients.

Suggested edit:

Removing fields from GraphQL schemas can become challenging. Removing a field is a breaking change that would disrupt the functionality of client applications that rely on the field. To address this, GraphQL allows deprecating fields without removing them. Being able to observe which fields are being requested by API clients can help understand the impact of removing a deprecated field.

SonjaChevre commented 1 year ago

thanks a lot for spotting this @benjie, I have updated the description.

arielvalentin commented 1 year ago

Inlining the contents of a related discussion here:

We have a use case where we want to capture complex JSON objects in span event attributes. The solution proposed is to serialize the value into a JSON string, but that can be challenging to use effectively in some back end systems.

In the example I liked above we are trying to represent validation errors in span events https://graphql.org/learn/validation/:
{
  "errors": [
    {
      "message": "Field \"name\" must not have a selection since type \"String!\" has no subfields.",
      "locations": [
        {
          "line": 4,
          "column": 10
        }
      ]
    }
  ]
}
Looking at the object I can see some direct mappings to OTel Trace SemConv attributes, while others may be a little more ambiguous:
{
  "events": [
    {
      "name": "Field \"name\" must not have a selection since type \"String!\" has no subfields.",
      "attributes": {
        "graphql.validation.errors": [
          {
            "code.lineno": 4,
            "code.column": 10
          }
        ]
      }
    }
  ]
}
All that being said, what is the best way for us to represent these validation errors?

Should we include them in our instrumentation at all?

What are other language SIGs doing for these use cases?

arielvalentin commented 1 year ago

👋🏼 @becco @bearcherian @dinonuggies1 @rmosolgo If would be great to get your feedback and input for this.

SonjaChevre commented 1 year ago

Another interesting use case that could be part of this initiative: https://github.com/open-telemetry/semantic-conventions/issues/1011

PascalSenn commented 1 month ago

Just talked to Budha from tyk at the graphql conf.

This pull request extends the GraphQL Semantic Conventions: https://github.com/open-telemetry/semantic-conventions/pull/562

I think this would go well together with this proposal.

open-telemetry / semantic-conventions