Client Library Throws Exception If Agent Is Not Available

noamichael commented 5 years ago

Requirement - what kind of business use case are you trying to solve?

While testing our application, we only deploy the application itself without having to spin up an instance of a jaeger agent or collector.

Problem - what in Jaeger blocks you from solving the requirement?

During a deployment of our app, it's possible the agent/collector is not yet available or will never be available. It seems that when this is the case, the csharp library throws an exception instead of silently failing.

The tracer instance is configured through the environment variables:

JAEGER_SERVICE_NAME: 'dotnetcore-example'
JAEGER_AGENT_HOST: 'jaeger'

And provided via DI:

services.AddOpenTracing();       
services.AddSingleton<ITracer>(serviceProvider =>
{
    var tracer = Jaeger.Configuration.FromEnv(loggerFactory).GetTracer();
    // Allows code that can't use DI to also access the tracer.
    GlobalTracer.Register(tracer);

    return tracer;
});

When the application attempts to bootstrap, the following exception occurs:

Unhandled Exception: System.Net.Internals.SocketExceptionFactory+ExtendedSocketException
--
  | at System.Net.Dns.InternalGetHostByName(String hostName)
  | at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
  | at System.Net.Sockets.UdpClient.Connect(String hostname, Int32 port)
  | at Jaeger.Senders.UdpSender..ctor(String host, Int32 port, Int32 maxPacketSize)
  | at Jaeger.Configuration.SenderConfiguration.GetSender()
  | at Jaeger.Configuration.ReporterConfiguration.GetReporter(IMetrics metrics)
  | at Jaeger.Configuration.GetTracerBuilder()
  | at Jaeger.Configuration.GetTracer()
  | at app.Startup.<>c__DisplayClass4_0.<ConfigureServices>b__1(IServiceProvider serviceProvider) in /opt/app-root/src/app/Startup.cs:line 56
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped(ScopedCallSite scopedCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped(ScopedCallSite scopedCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped(ScopedCallSite scopedCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitIEnumerable(IEnumerableCallSite enumerableCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitScoped(ScopedCallSite scopedCallSite, ServiceProviderEngineScope scope)
  | at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService(IServiceProvider provider, Type serviceType)
  | at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService[T](IServiceProvider provider)
  | at Microsoft.AspNetCore.Hosting.Internal.WebHost.StartAsync(CancellationToken cancellationToken)
  | at Microsoft.AspNetCore.Hosting.WebHostExtensions.RunAsync(IWebHost host, CancellationToken token, String shutdownMessage)
  | at Microsoft.AspNetCore.Hosting.WebHostExtensions.RunAsync(IWebHost host, CancellationToken token)
  | at Microsoft.AspNetCore.Hosting.WebHostExtensions.Run(IWebHost host)
  | at app.Program.Main(String[] args) in /opt/app-root/src/app/Program.cs:line 18

Based on this issue, it would seem that this is not the expected behavior.

If the agent is down and application is still emitting traces, they will be lost until the agent is back

It there anything wrong with my configuration, or is this consider a bug in the client?

Falco20019 commented 5 years ago

Hi, thanks for filing this issue. This is not a bug but defined behavior in Java and C#. There is a difference between the endpoint being down and the agent not being discoverable.

The issue you linked is about the agent being discoverable but not reachable. In this case, the Client will not throw. But if you create an UdpSender using a DNS name instead of an IP, the DNS will be resolved. If this fails, the endpoint will be rejected.

This exception is currently expected since we oriented our behavior after Java (see our unit test).

https://github.com/jaegertracing/jaeger-client-csharp/blob/cdefd23957d1ebaea30bd528b1c64974100f1765/test/Jaeger.Tests/ConfigurationTests.cs#L262-L268

This is only happening with UdpSender. When using an HttpSender (jaeger-collector), this will not happen.

@yurishkuro Is this only the case with Java and C#? Because then it might be better to delay DNS resolution to reporting time and retry on reports.

noamichael commented 5 years ago

@Falco20019

...if you create an UdpSender using a DNS name instead of an IP, the DNS will be resolved. If this fails, the endpoint will be rejected.

The reason we're using a host name is because we deploy within a Kubernetes environment and we don't have access to the pod/container IP. I don't think we can avoid using the host name.

This is only happening with UdpSender. When using an HttpSender (jaeger-collector), this will not happen.

Is this the recommended approach for now - to use the HttpSender (JAEGER_ENDPOINT)? I was hoping to sidecar the agents within the same pod as the application.

Because then it might be better to delay DNS resolution to reporting time and retry on reports.

I think it's better for you to retry whenever a report is made. Perhaps log that the host is unreachable and retry with every report.

Falco20019 commented 5 years ago

When using jaeger as sidecar, the DNS should be resolvable immediately, even when the service is not yet reachable. That’s the regular use case. I think it should only fail if you deploy it without jaeger and therefore when there is no DNS name.

I would make it configurable (though environmental flags) in your case to just disable jaeger completely for the cases when you don’t intend to sidecar it. OpenTracing has a NoopTracer registered to the GlobalTracer for this purpose. So I would suggest using ITracer. I don’t know your exact use case, so I can’t give you a proper suggestion.

I‘m pretty sure over at the java client should be a similar discussion because it existed much longer than the C# implementation. I will also wait for what @yurishkuro has to say on this topic.

noamichael commented 5 years ago

When using jaeger as sidecar, the DNS should be resolvable immediately, even when the service is not yet reachable.

I see, that makes sense. In this case it's the fact that our tests don't include the sidecar. It would still be nice if it didn't fail hard at that point.

Falco20019 commented 5 years ago

I understand your point. We also kept it do be in sync with Java since this is our reference. We use the behavior to check for misconfiguration. For your test environment, it could be enough to set JAEGER_ENDPOINT so that it‘s not failing until we decide if it should fail or not.

Another option (preferred by me) would look like this:

services.AddSingleton<ITracer>(serviceProvider =>
{
    try {
        // This will fail if sidecar is not used.
        var tracer = Jaeger.Configuration.FromEnv(loggerFactory).GetTracer();

        // Allows code that can't use DI to also access the tracer.
        GlobalTracer.Register(tracer);
    } catch {
        // You might want to log this.
        // We left before registering, so GlobalTracer refers to NoopTracer.
    }

    return GlobalTracer.Instance;
});

You could also limit the catch to SocketException In case that you want other exceptions to not be caught.

I have no strong feelings for keeping it or getting rid of it right now, that’s why I want to wait for some more knowledge from @yurishkuro first, as he designed the protocol. I think his go client is ignoring it and not throwing. I just want to avoid that someone else is also using this as wanted behavior. When going to 1.0.0, there will be breaking changes anyway, so we could also change this if decided.

yurishkuro commented 5 years ago

I think it's reasonable to not throw until reporting.

jaegertracing / jaeger-client-csharp

Client Library Throws Exception If Agent Is Not Available #153

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?