OpenTelemetry integration

meskill commented 8 months ago

Description

Provide integration with opentelemetry for taicall with support for different exporters and configurations depending on users needs.

User perspective

If user is not interested in opentelemetry the tailcall should work as before and no additional actions for user should be done.

If user wants to enable opentelemetry output from tailcall they can use new directive on schema @opentelemetry that specifies settings where to export data and in which format.

Example of config:

schema
  @server(port: 8000, graphiql: true, hostname: "0.0.0.0")
  @upstream(baseURL: "http://jsonplaceholder.typicode.com", httpCache: true)
  @opentelemetry(
    export: {
      otlp: {
        url: "https://api.honeycomb.io:443"
        # gather api key from https://ui.honeycomb.io and set it as env when running tailcall
        headers: [{key: "x-honeycomb-team", value: "{{env.HONEYCOMB_API_KEY}}"}]
      }
    }
  ) {
  query: Query
}

In that case opentelemetry data from taillcall will be exported to the provided service and the responsibility to aggregate and process that data is on that external service

Development perspective

Opentelemetry provides various Rust crates that implements different aspects of integration into the app.

Core

Core should be able to generate any opentelemetry data when needed in simple way preferably without any feature flags inside the code.

For tracing and logs we can use tracing crate instead of log. Benefits of it is that tracing manages traces and logs already, have built-in methods to create different wrappers and the data from it could be exported as opentelemetry data with tracing-opentelemetry crate.

For metrics we can't use tracing and have to use opentelemetry crates functionality explicitly. It should use available functionality to send data from opentelemetry core that is not tied to specific exporters

[x] migrate from log to tracing
[x] export tracing data to opentelemetry with the example of evaluating field resolver
[x] add metrics support with the example for resolver cache
[x] https://github.com/tailcallhq/tailcall/issues/1260
[x] test apollo studio opentelemetry extension
[x] https://github.com/tailcallhq/tailcall/issues/1261
[x] https://github.com/tailcallhq/tailcall/issues/1262

CLI/Native app

The specific environment should define exporters based on the passed configuration. This is done mostly by specific crates for opentelemetry.

The first implementation should start with a couple of available integration and should be easily extensible by additional options in the future.

[x] integrate opentelemetry_stdout
[x] integrate opentelemetry_otlp

WASM

[x] https://github.com/tailcallhq/tailcall/issues/1259

Performance

Initial integration with 2 spans and 1 metric doesn't show significant changes in performance.

But using async-graphql::extensions::OpenTelemetry reduces overall RPS for benchmark by 30%, but it outputs a lot of spans with most of them are basically no-op function for fields with no resolvers. That's probably could be stripped in some way or ignored.

Testing

[x] implement integration test to verify that opentelemetry data is captured

meskill commented 7 months ago

Enabling async_graphql::extensions::OpenTelemetry generates a lot of redundant spans for every field of the entity. E.g. for list of posts it looks like this: