jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.51k stars 2.44k forks source link

[jaeger-v2] Dangerous use of tracer in plugin/storage/grpc/factory.go #5971

Closed yurishkuro closed 2 weeks ago

yurishkuro commented 2 months ago

The factory is passed a tracer that it then passes to OTEL's ToClientConn() function, which automatically adds tracing instrumentation to gRPC connection. That means we may start generating new traces as we're trying to save spans to storage, thus causing an infinite loop of trace generation. The write path must always be devoid of tracing (or use very low sampling rate) to avoid this.

Related upstream issue https://github.com/open-telemetry/opentelemetry-collector/issues/10663

dosubot[bot] commented 1 month ago

To prevent the infinite loop of trace generation caused by the tracer in plugin/storage/grpc/factory.go in Jaeger v2, you need to ensure that the tracer used for gRPC communication does not generate traces for its own operations. This can be achieved by using a TracerProvider that is configured to ignore certain operations or by using a SpanProcessor that filters out these traces.

Here is an example of how you can modify the newRemoteStorage function to achieve this:

func (f *Factory) newRemoteStorage(telset component.TelemetrySettings, newClient newClientFn) (*ClientPluginServices, error) {
    c := f.configV2
    // Create a new TracerProvider that excludes gRPC operations
    tracerProvider := otel.GetTracerProvider()
    // Add a resource to the TracerProvider to identify the gRPC client
    resource := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceNameKey.String("grpc-client"),
    )
    tracerProvider = sdktrace.NewTracerProvider(
        sdktrace.WithResource(resource),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.NeverSample())),
    )

    opts := []grpc.DialOption{
        grpc.WithStatsHandler(otelgrpc.NewClientHandler(otelgrpc.WithTracerProvider(tracerProvider))),
    }
    if c.Auth != nil {
        return nil, fmt.Errorf("authenticator is not supported")
    }

    tenancyMgr := tenancy.NewManager(&c.Tenancy)
    if tenancyMgr.Enabled {
        opts = append(opts, grpc.WithUnaryInterceptor(tenancy.NewClientUnaryInterceptor(tenancyMgr)))
        opts = append(opts, grpc.WithStreamInterceptor(tenancy.NewClientStreamInterceptor(tenancyMgr)))
    }

    remoteConn, err := newClient(opts...)
    if err != nil {
        return nil, fmt.Errorf("error creating remote storage client: %w", err)
    }
    f.remoteConn = remoteConn
    grpcClient := shared.NewGRPCClient(remoteConn)
    return &ClientPluginServices{
        PluginServices: shared.PluginServices{
            Store:               grpcClient,
            ArchiveStore:        grpcClient,
            StreamingSpanWriter: grpcClient,
        },
        Capabilities: grpcClient,
    }, nil
}

In this example, a new TracerProvider is created with a NeverSample sampler, which effectively disables tracing for the gRPC client operations. This prevents the infinite loop of trace generation [1].

To continue talking to Dosu, mention @dosu.

mahadzaryab1 commented 4 weeks ago

@yurishkuro is this blocked on the upstream issue? or is there anything we can do right now to fix this?

yurishkuro commented 3 weeks ago

@mahadzaryab1 there is unfinished PR 5979. It's a bit hacky solution.

I would've preferred for this to be fixed upstream, but that ticket did not see much traction. We could just propose a new hook in OTEL to be usable from binaries like Jaeger, which would allow us to customer the tracer somehow, e.g. by using it with Jaeger Remote Sampler SDK extension which would allow to configure undesired endpoints with 0 probability.

Another option we could try to use is to inject a noop span into the context when the execution comes to places that we do not want to be traced - this will at least prevent downstream tracing into storage writes, but will not help with say healthcheck endpoint tracing (tbh that one could be fixed as a configuration option in the extension itself, to say "disable tracing" - assuming it's an issue today).