registerInstrumentations uses the global provider instead of the one passed in

goleary commented 4 months ago

What happened?

Steps to Reproduce

Create an open telemetry setup with 2 providers. Register one globally and pass the other as tracerProvider to registerInstrumentations

Then run a process which results in the instrumentation generating traces.

Expected Result

The instrumentation will only write to the provider passed as tracerProvider

Actual Result

The instrumentation writes to the global provider (defaultProvider

Additional Details

Basically I want to sample the traces coming from from Prisma, but I do not want to sample the traces that we manually instrument via wrapWithSpan etc. Not sure if there is some other way to achieve this, but it's very confusing that the provider I pass in to registerInstrumentaitons is not used.

OpenTelemetry Setup Code

import * as api from "@opentelemetry/api";
import { AsyncHooksContextManager } from "@opentelemetry/context-async-hooks";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { Resource } from "@opentelemetry/resources";
import {
  BasicTracerProvider,
  BatchSpanProcessor,
  ConsoleSpanExporter,
  SimpleSpanProcessor,
  TraceIdRatioBasedSampler,
} from "@opentelemetry/sdk-trace-base";
import {
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";
import { PrismaInstrumentation } from "@prisma/instrumentation";
import { ReploError } from "schemas/errors";

import { gitSha, reploEnv } from "../env";

export function openTelemetrySetup() {
  // NOTE (Gabe 2024-05-07): a context manager is required to propagate the
  // context for span nesting
  const contextManager = new AsyncHooksContextManager().enable();
  api.context.setGlobalContextManager(contextManager);

  // NOTE (Gabe 2024-05-07): This uses OTEL_EXPORTER_OTLP_XXXX env vars for
  // configuration.
  const otlpTraceExporter = new OTLPTraceExporter();

  const defaultProvider = new BasicTracerProvider({
    resource: new Resource({
      [SEMRESATTRS_SERVICE_NAME]: "replo-publisher",
      [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: reploEnv,
      [SEMRESATTRS_SERVICE_VERSION]: gitSha,
    }),
  });
  const sampledProvider = new BasicTracerProvider({
    // NOTE (Gabe 2024-05-07): Sampling is enabled here to limit the performance
    // impact.
     sampler: new TraceIdRatioBasedSampler(0.1),
    resource: new Resource({
      [SEMRESATTRS_SERVICE_NAME]: "replo-publisher-prisma",
      [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: reploEnv,
      [SEMRESATTRS_SERVICE_VERSION]: gitSha,
    }),
  });

  if (reploEnv === "production") {
    sampledProvider.addSpanProcessor(new BatchSpanProcessor(otlpTraceExporter));
    defaultProvider.addSpanProcessor(new BatchSpanProcessor(otlpTraceExporter));
  } else {
    // NOTE (Gabe 2024-05-07): this can be uncommented to write traces to the
    // console during development.
    sampledProvider.addSpanProcessor(
      new SimpleSpanProcessor(new ConsoleSpanExporter()),
    );
    defaultProvider.addSpanProcessor(
      new SimpleSpanProcessor(new ConsoleSpanExporter()),
    );
  }

  defaultProvider.register();
  api.trace.setGlobalTracerProvider(defaultProvider);

  registerInstrumentations({
    instrumentations: [new PrismaInstrumentation()],
    tracerProvider: sampledProvider,
  });
}
const tracer = api.trace.getTracer("replo-publisher");

export async function wrapWithSpan<R>(
  fn: () => Promise<R>,
  name: string,
  parentContext: api.Context,
): Promise<R> {
  const span = tracer.startSpan(name, {}, parentContext);
  try {
    const result = await fn();
    span.setStatus({ code: api.SpanStatusCode.OK });
    return result;
  } catch (error: unknown) {
    let message: string | undefined;
    if (error instanceof ReploError) {
      message = error.message;
    }
    span.setStatus({
      code: api.SpanStatusCode.ERROR,
      message,
    });
    throw error;
  } finally {
    span.end();
  }
}

package.json

{
  "name": "",
  "private": true,
  "version": "0.0.0",
  "workspaces": {
    "nohoist": [
      "react"
    ]
  },
  "scripts": {
    "dev": "dotenv -e ../../.env -- concurrently -k --names \"Esbuild,Nodemon\" -c \"bgBlue.bold,bgMagenta.bold,bgRed.bold\" \"yarn watch:build\" \"yarn watch:server\"",
    "watch:build": "tsx scripts/build.mts --watch=true",
    "watch:server": "doppler run -p publisher -c dev -- nodemon \"./dist/main.js\" --watch \"./dist\"",
    "build": "tsx scripts/build.mts",
    "docker-build": "docker build -t replo-publisher -f Dockerfile ../../ --build-arg=\"COMMIT=$(git rev-parse HEAD)\"",
    "docker-run": "./scripts/docker-run.sh",
    "start": "node dist/main.js",
    "start-ci": "IS_TESTING=true doppler run -p publisher -c dev -- yarn start",
    "lint": "eslint 'src/**/*.{js,jsx,ts,tsx}' --fix --max-warnings=0",
    "lint:report": "eslint --output-file eslint_report.json --format json 'src/**/*.{js,jsx,ts,tsx}' --max-warnings=0",
    "tsc": "tsc",
    "prettier": "prettier --write .",
    "test": "IS_TESTING=true doppler run -p publisher -c dev -- vitest --config=vite.unit.config.mts --run",
    "watch-test": "vitest --config=vite.unit.config.mts",
    "integration:dev": "IS_TESTING=true doppler run -p publisher -c dev -- vitest --config=vite.integration.config.mts",
    "integration": "IS_TESTING=true doppler run -p publisher -c dev -- vitest --config=vite.integration.config.mts --run",
    "start:test": "concurrently -k --names \"Esbuild,Nodemon\" \"yarn watch:build\" \"yarn watch:server\"",
    "clean": "rm -rf node_modules && rm -rf dist && rm -rf .turbo",
    "graphql:codegen": "doppler run -p publisher -c dev -- graphql-codegen --config graphql-codegen.ts",
    "patch": "patch-package"
  },
  "dependencies": {
    "@aws-sdk/client-s3": "^3.238.0",
    "@fastify/autoload": "5.8.0",
    "@fastify/cors": "9.0.1",
    "@fastify/helmet": "11.1.1",
    "@fastify/multipart": "8.2.0",
    "@fastify/rate-limit": "9.1.0",
    "@fastify/static": "7.0.3",
    "@immobiliarelabs/fastify-sentry": "8.0.1",
    "@logdna/logger": "^2.6.6",
    "@opentelemetry/api": "^1.8.0",
    "@opentelemetry/context-async-hooks": "^1.24.1",
    "@opentelemetry/exporter-trace-otlp-http": "^0.51.1",
    "@opentelemetry/instrumentation": "^0.51.1",
    "@opentelemetry/resources": "^1.24.1",
    "@opentelemetry/sdk-trace-base": "^1.24.1",
    "@opentelemetry/sdk-trace-node": "^1.24.1",
    "@opentelemetry/semantic-conventions": "^1.24.1",
    "@prisma/client": "^5.13.0",
    "@prisma/instrumentation": "^5.13.0",
    "@prisma/nextjs-monorepo-workaround-plugin": "^5.13.0",
    "@segment/analytics-node": "^0.0.1-beta.17",
    "@sentry/core": "7.112.1",
    "@sentry/integrations": "7.112.1",
    "@sentry/node": "7.112.1",
    "@trpc-limiter/memory": "0.0.3",
    "@trpc/server": "next",
    "@types/async-retry": "^1.4.8",
    "@types/safe-compare": "^1.1.2",
    "add": "^2.0.6",
    "async-retry": "^1.3.3",
    "autoprefixer": "^10.4.14",
    "customerio-node": "^4.1.1",
    "esbuild": "0.17.19",
    "esbuild-plugin-pino": "2.1.1",
    "esbuild-sass-plugin": "2.9.0",
    "fastify": "4.25.2",
    "fastify-plugin": "4.5.1",
    "fastify-raw-body": "4.3.0",
    "fastify-sse-v2": "^4.0.0",
    "fastify-type-provider-zod": "1.1.9",
    "fs-extra": "^11.1.0",
    "google-auth-library": "^9.6.2",
    "got": "^12.5.2",
    "graphql-request": "^6.1.0",
    "graphql-tag": "^2.11.0",
    "jsonrepair": "^3.8.0",
    "jsonwebtoken": "^9.0.0",
    "lodash-es": "^4.17.21",
    "node-fetch": "^3.3.1",
    "node-html-parser": "^6.1.5",
    "openai": "^4.32.1",
    "patch-package": "^8.0.0",
    "pino": "9.0.0",
    "pino-pretty": "11.0.0",
    "postcss": "^8.4.24",
    "postcss-minify": "^1.1.0",
    "posthog-node": "^3.1.3",
    "prom-client": "14.2.0",
    "query-string": "^8.1.0",
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "replo-database": "*",
    "replo-logger": "*",
    "replo-runtime": "*",
    "replo-shared": "*",
    "replo-utils": "*",
    "retoolrpc": "^0.1.6",
    "safe-compare": "^1.1.4",
    "sass": "1.58.1",
    "schemas": "*",
    "stripe": "^14.19.0",
    "superjson": "2.2.1",
    "tinycolor2": "^1.6.0",
    "uuid": "9.0.1",
    "workerpool": "^9.1.1",
    "zod": "^3.19.1",
    "zod_utilz": "^0.7.3",
    "zx": "^7.2.3"
  },
  "devDependencies": {
    "@graphql-codegen/cli": "4.0.1",
    "@graphql-codegen/typescript": "^4.0.0",
    "@graphql-codegen/typescript-graphql-request": "^5.0.0",
    "@graphql-codegen/typescript-operations": "^4.0.0",
    "@sentry/esbuild-plugin": "2.16.1",
    "@types/fs-extra": "^9.0.13",
    "@types/jsonwebtoken": "^9.0.2",
    "@types/lodash-es": "^4.17.6",
    "@types/node": "^20.12.4",
    "concurrently": "^7.3.0",
    "css-loader": "^6.7.3",
    "dotenv": "^16.0.1",
    "dotenv-cli": "^7.3.0",
    "eslint": "^8.47.0",
    "eslint-config-replo": "*",
    "graphql": "^15.8.0",
    "nodemon": "^2.0.20",
    "prettier": "3.2.5",
    "tsconfig-replo": "*",
    "tsx": "4.10.5",
    "typescript": "^5.4.5",
    "vite": "5.0.12",
    "vitest": "1.2.2"
  },
  "engines": {
    "node": "20.x"
  }
}

Relevant log output

replo-publisher:dev: [Nodemon]   resource: {
replo-publisher:dev: [Nodemon]     attributes: {
replo-publisher:dev: [Nodemon]       'service.name': 'replo-publisher',
replo-publisher:dev: [Nodemon]       'telemetry.sdk.language': 'nodejs',
replo-publisher:dev: [Nodemon]       'telemetry.sdk.name': 'opentelemetry',
replo-publisher:dev: [Nodemon]       'telemetry.sdk.version': '1.24.1',
replo-publisher:dev: [Nodemon]       'deployment.environment': 'development',
replo-publisher:dev: [Nodemon]       'service.version': undefined
replo-publisher:dev: [Nodemon]     }
replo-publisher:dev: [Nodemon]   },
replo-publisher:dev: [Nodemon]   traceId: '12612a54851409041dcd25156a61bdd2',
replo-publisher:dev: [Nodemon]   parentId: '2d5668bd154cce8b',
replo-publisher:dev: [Nodemon]   traceState: undefined,
replo-publisher:dev: [Nodemon]   name: 'prisma:engine',
replo-publisher:dev: [Nodemon]   id: '059228e56d19b1e4',
replo-publisher:dev: [Nodemon]   kind: 0,
replo-publisher:dev: [Nodemon]   timestamp: 1718070662384988,
replo-publisher:dev: [Nodemon]   duration: 2000,
replo-publisher:dev: [Nodemon]   attributes: {},
replo-publisher:dev: [Nodemon]   status: { code: 0 },
replo-publisher:dev: [Nodemon]   events: [],
replo-publisher:dev: [Nodemon]   links: []
replo-publisher:dev: [Nodemon] }

pichlermarc commented 4 months ago

It looks like the primsa instrumentation does not use that tracer that's passed in, but gets the tracer directly from the API. This does not look like it's a bug in registerInstrumentation().

Is there a specific reason why you want to sample traces that are created by the Prisma instrumentation, but not your own? The sampled flag is part of the trace context, as it's not common to sample part of a trace (dropping prisma spans sometimes, keeping all manually instrumented spans, if I understand correctly) - that's also one of the reasons why Span ID is not part of the sampling decision. Telemetry will be inconsistent this way and it'll always look like there's stuff missing.

goleary commented 4 months ago

Got it, yeah looks like a bug on Prisma's implementation, guess we can close this one. Thanks for the quick response and investigation!

The reason I have it set up this way is that I want to use sampling on Prisma traces because we generate tons of them and I do not want to affect the performance of our application, nor do we actually need all of the. I do NOT want to sample the manually instrumented traces because we only use them for certain key processes where we want as much visibility as possible.

Not sure I understand what you're suggesting. My assumption is that because the sampling is configured on the provider and that if I have two separate providers then I should be able to sample certain traces and not others. Is there some other way I can achieve this behavior?

goleary commented 4 months ago

hey @pichlermarc I closed the issue because I don't believe it's a problem with this repo, but I would still love you input on the above when you get a second 🙏

The reason I have it set up this way is that I want to use sampling on Prisma traces because we generate tons of them and I do not want to affect the performance of our application, nor do we actually need all of the. I do NOT want to sample the manually instrumented traces because we only use them for certain key processes where we want as much visibility as possible. Not sure I understand what you're suggesting. My assumption is that because the sampling is configured on the provider and that if I have two separate providers then I should be able to sample certain traces and not others. Is there some other way I can achieve this behavior?

pichlermarc commented 4 months ago

The reason I have it set up this way is that I want to use sampling on Prisma traces because we generate tons of them and I do not want to affect the performance of our application, nor do we actually need all of the. I do NOT want to sample the manually instrumented traces because we only use them for certain key processes where we want as much visibility as possible.

Not sure I understand what you're suggesting. My assumption is that because the sampling is configured on the provider and that if I have two separate providers then I should be able to sample certain traces and not others. Is there some other way I can achieve this behavior?

Ah I see. The Sampler is a user-implementable interface for that reason. You could implement a sampler that in essence delegates to a AlwaysOnSampler for your own spans and a TraceIdBasedRatioSampler for the others (you could distingush them by their attributes, or by span name).

Then you can combine that with a ParentBasedSampler that will delegate the sampling decision to the your custom sampler if the span is a root span.

This way traces you started will always be kept, and traces you did not start will not be kept in n% of cases. This way you have one TracerProvider with a custom sampler that will make a sampling decision based on your criteria (and respect previous sampling decisions to avoid broken traces).

goleary commented 4 months ago

mmm that makes a lot of sense, I'll look into doing that.

The reason I didn't see my setup as being problematic is that the Prisma traces and our manual traces are entirely separate, neither is a parent of the other, so we wouldn't get into a situation where part of a trace disappears, but it sounds like y'all have better way to deal with this type of situation.

Thanks!!

goleary commented 4 months ago

I implemented a Sampler like you suggested and it works great, thanks again for the guidance!

pichlermarc commented 4 months ago

Glad I could help :)

open-telemetry / opentelemetry-js