performance: high latency when using otel libs inside common endpoints

danstarns commented 1 month ago

What happened?

I am writing this issue to express the performance issues I've encountered using your libraries within common Node.js APIs such as HTTP, Express, and GraphQL. I have some links, benchmarks and examples to share with you below.

We are building an observability platform leveraging OTEL JS on the users’ client side. It was reported to us that using our libraries with their APIs resulted in massive latency issues. We did some investigation broken down in full detail here:

https://github.com/rocket-connect/graphql-debugger/issues/322

Our implementation essentially wraps a GraphQL resolver, picking up on values in the tree info argument, capturing attributes, and starting a span. As described in our adjacent issue, we took several necessary steps to ensure it wasn’t an issue with our implementation. Additionally, we created a few benchmarks with and without GraphQL usage to compare performance impacts.

Hypothesis

We see that the code added in the following diff is causing a huge increase in latency to this endpoint.

import opentelemetry from "@opentelemetry/api";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import {
  BasicTracerProvider,
  SimpleSpanProcessor,
} from "@opentelemetry/sdk-trace-base";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

const provider = new BasicTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "basic-service",
  }),
});

const exporter = new OTLPTraceExporter({});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

app.use("/hello", async (req, res) => {
+  const tracer = opentelemetry.trace.getTracer("hello-tracer");
+  const span = tracer.startSpan("hello");
+  span.setAttribute("value", "world");
+  span.end();

  res.json({ message: "Hello World" }).end();
});

Alongside our benchmarks, as detailed below, we saw the following as confirmation of our hypothesis:

Benchmarks

When we ran with our hypothesis, we found a bunch of benchmarks like opentelemetry-java/benchmarks/ and https://github.com/open-telemetry/opentelemetry-js/issues/4171. We found these benchmarks to be not that helpful, and we were unable to compare the two languages or understand how OTEL JS affects performance inside an endpoint.

From this, we created a few types of benchmarks to test our findings and improve our performance.

Isolated OTEL JS Benchmarks

https://github.com/rocket-connect/otel-js-server-benchmarks

These benchmarks demonstrate using OTEL inside a basic endpoint. Our results show that adding just a few lines of OTEL code to your handler will result in a significant reduction in the performance of your API. For example, a basic http endpoint operating at 6.26ms latency more than triples the average time to 22.03ms when OTEL is added, rendering it unusable for any production setting.

GraphQL Debugger Benchmarks

GraphQL Specific fork of isolated benchmarks:

rocket-connect/graphql-debugger/benchmarks

pichlermarc commented 1 month ago

Hi thanks for the detailed report!

I've had a quick look at the benchmark code, and it looks like the SimpleSpanProcessor is used there. It will export every ended span individually causing the exporter to send an HTTP request for each of these spans. The SimpleSpanProcessor is best used for debugging and local development with the ConsoleSpanExporter as it provides quick feedback.

Consider using the BatchSpanProcessor for production use, this one will actually batch spans together and only pass them on to the SpanExporter in a batched manner, which may improve performance by reducing the amount of export requests made significantly.

danstarns commented 1 month ago

Thank you for your response @pichlermarc

I added batch versions of the benchmarks:

Isolated OTEL JS Benchmarks

https://github.com/rocket-connect/otel-js-server-benchmarks

GraphQL Debugger Benchmarks

https://github.com/rocket-connect/graphql-debugger/tree/main/benchmarks

Summary

The benchmarks indicate that using the BatchProcessor is significantly more performant than the SimpleProcessor. Thank you again for pointing this out. I have some questions:

Our benchmarks only produce a single span. Given there is nothing to batch, how does changing this processor affect the performance? Does this mean the SimpleProcessor is exporting the spans inline synchronously? Is this even possible?

Are the differences between the two processors documented anywhere?

Could the logic be wrapped in a setTimeout to prevent blocking the immediate loop?

Would it be wise to add a log in the console when users use SimpleProcessor?

pichlermarc commented 1 month ago

The benchmarks indicate that using the BatchProcessor is significantly more performant than the SimpleProcessor. Thank you again for pointing this out. I have some questions:

Glad to hear that this improved things. :slightly_smiling_face: There's still some performance overhead, but we're hoping to gradually improve this in the future by optimizing the SDK and Exporters.

Our benchmarks only produce a single span. Given there is nothing to batch, how does changing this processor affect the performance?

It batches spans over multiple requests, once the batch is large enough it will then export the spans.

Does this mean the SimpleProcessor is exporting the spans inline synchronously? Is this even possible?

It does some work synchronously, namely serializing the OTLP message and starting the http request it seems to be enough to block the execution.

Are the differences between the two processors documented anywhere?

It is documented on opentelemetry.io, but I think it would also make sense to document this in the JSDoc for the processor.

Could the logic be wrapped in a setTimeout to prevent blocking the immediate loop?

setTimeout or setImmediate would be an option. However, AFAIK it was a conscious design choice that the SimpleSpanProcessor does not defer the execution of the export to a later time.

Would it be wise to add a log in the console when users use SimpleProcessor?

I'd go with a JSDoc addition first, then re-evaluate if we need the log. :slightly_smiling_face:

danstarns commented 1 month ago

Hi @pichlermarc.

There's still some performance overhead, but we're hoping to gradually improve this in the future by optimizing the SDK and Exporters.

I agree on the overhead. Even with using the Batch processor, in the screenshot, from our benchmarks, it shows that adding a few lines of otel js, using the batch processor, increases latency from 5.99ms to 8.58ms plus it impacts kps too.

Due to these findings, I made a PR to move the benchmarks I am using into this repo:

https://github.com/open-telemetry/opentelemetry-js/pull/4795

I have added these benchmarks to the source code of OpenTelemetry as they provide transparent insight into performance plus allow the community to iterate on performance improvements.

setTimeout or setImmediate would be an option. However, AFAIK it was a conscious design choice that the SimpleSpanProcessor does not defer the execution of the export to a later time.

Once we have some benchmarks in place we could experiment with things like this, as introducing a timeout and injecting in the current timestamp should not affect people's applications.

I'd go with a JSDoc addition first, then re-evaluate if we need the log.

Added here:

https://github.com/open-telemetry/opentelemetry-js/pull/4798

open-telemetry / opentelemetry-js