open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
220 stars 141 forks source link

Improve GraphQL semantic conventions #182

Open SonjaChevre opened 11 months ago

SonjaChevre commented 11 months ago

As GraphQL is gaining popularity as a query language for APIs, we (Tyk Technologies, maintainer of the Tyk open source API Gateway) would like to work on enhancing the existing semantic conventions for GraphQL instrumentation.

What is GraphQL?

GraphQL was created by Facebook in 2012 and was publicly released in 2015; it gained popularity due to its ability to solve data fetching challenges by providing a more efficient and declarative approach to API data querying and manipulation.

More about GraphQL:

GraphQL | A query language for your API GraphQL Landscape

What are specific observability challenges with GraphQL?

Here is a non exhaustive list:

1. Error detection

In GraphQL, errors are returned as part of the data response with a 200 HTTP status code, even in the case of partially successful queries. When monitoring a GraphQL request with OpenTelemetry, this means that the distributed trace usually look ok (because of the 200 HTTP status code) even when GraphQL is returning errors.

See also: GraphQL error specification

2. Performance monitoring

Monitoring a GraphQL server is not straightforward as the performance depends highly on what queries customers are sending.

When using GraphQL for internal APIs that are only accessed by internal clients, the queries won’t change often and we could monitor the performance on a query level. But if our API is available externally, we can have hundreds of slightly different queries.

Performance issues could be related to the query lifecycle (parse, validate, execute, resolve) or to specific resolver (function that retrieves or mutates data for a specific field in a GraphQL schema during the query execution process) depending on the fields requested in the request.

3. Removing deprecated fields

GraphQL is considered "version-free" because it eliminates the need for maintaining different API versions. The shape and structure of data returned are determined by the client's query, allowing seamless evolution and addition of features without breaking existing clients. This flexibility simplifies development and reduces compatibility issues.

Removing fields from GraphQL schemas can become challenging. Removing a field is a breaking change that would disrupt the functionality of client applications that rely on the field. To address this, GraphQL allows deprecating fields without removing them. Being able to observe which fields are being requested by API clients can help understand the impact of removing a deprecated field.

What is the current support of the GraphQL in OTel?

Currently, the semantic conventions for GraphQL contains three attributes:

There are currently 5 instrumentation libraries for GraphQL:

We have only tried the Node.js instrumentation so far, but noticed that this library doesn’t respect the semantic conventions, but contain much more valuable information that could be standardised.

What are we missing in the semantic conventions?

Non exhaustive list:

What is the suggested approach?

We are actively working on adding this information to our own GraphQL engine (Universal Data Graph) and would welcome other member of the observability and GraphQL community to join us on improving the semantic convention.

Looking forward to see if this proposal gets any interest!

Sonja

Note: we are also working on another proposal to introduce semantic conventions for API Gateways: https://github.com/open-telemetry/semantic-conventions/issues/183

michaelstaib commented 11 months ago

Hi @SonjaChevre,

I am a GraphQL TSC and the author of the .NET GraphQL server HotChocolate.

In .NET we have a far more extensive implementation of OTel for GraphQL that allows for resolver level instrumentation and also covers the request pipeline. I agree that current proposed definitions is not enough.

For resolvers we only cover report relevant resolvers, basically have span for resolvers that cause IO. Since HotChocolate uses execution plans we also go beyond traditional GraphQL concerns. Interested in connecting?

SonjaChevre commented 11 months ago

Hi @michaelstaib - yes please! what's the best way to connect (CNCF slack, e-mail, ...)?

benjie commented 11 months ago

But deprecating fields in GraphQL can become challenging. Deprecating a field can be considered a breaking change, potentially disrupting the functionality of client applications that rely on the deprecated field. Being able to observe which fields are being requested by API clients can help understand the impact of deprecating fields.

Just a note that deprecating fields should not cause any issues for existing queries/clients (i.e. the statement "Deprecating a field can be considered a breaking change" should not be true). Removing deprecated fields could cause issues, which is why monitoring of which fields are actually used is important should you wish to do so. Some GraphQL APIs will deprecate fields but never remove them - essentially telling new clients not to use those fields, but still supporting them for old clients.

Suggested edit:

Removing fields from GraphQL schemas can become challenging. Removing a field is a breaking change that would disrupt the functionality of client applications that rely on the field. To address this, GraphQL allows deprecating fields without removing them. Being able to observe which fields are being requested by API clients can help understand the impact of removing a deprecated field.

SonjaChevre commented 11 months ago

thanks a lot for spotting this @benjie, I have updated the description.

arielvalentin commented 11 months ago

Inlining the contents of a related discussion here:

We have a use case where we want to capture complex JSON objects in span event attributes. The solution proposed is to serialize the value into a JSON string, but that can be challenging to use effectively in some back end systems.

In the example I liked above we are trying to represent validation errors in span events https://graphql.org/learn/validation/:

{
  "errors": [
    {
      "message": "Field \"name\" must not have a selection since type \"String!\" has no subfields.",
      "locations": [
        {
          "line": 4,
          "column": 10
        }
      ]
    }
  ]
}

Looking at the object I can see some direct mappings to OTel Trace SemConv attributes, while others may be a little more ambiguous:

{
  "events": [
    {
      "name": "Field \"name\" must not have a selection since type \"String!\" has no subfields.",
      "attributes": {
        "graphql.validation.errors": [
          {
            "code.lineno": 4,
            "code.column": 10
          }
        ]
      }
    }
  ]
}

All that being said, what is the best way for us to represent these validation errors?

Should we include them in our instrumentation at all?

What are other language SIGs doing for these use cases?

arielvalentin commented 11 months ago

👋🏼 @becco @bearcherian @dinonuggies1 @rmosolgo If would be great to get your feedback and input for this.

SonjaChevre commented 11 months ago

Another interesting use case that could be part of this initiative: https://github.com/open-telemetry/semantic-conventions/issues/1011