microsoft / ApplicationInsights-ServiceFabric

ApplicationInsights SDK for ServiceFabric projects
MIT License
63 stars 26 forks source link

Remoting Dependency Tracking Reporting Very Long Durations? #99

Open ththiem opened 5 years ago

ththiem commented 5 years ago

I'd like to better understand the numbers being reported by the dependency tracking for this package. After implementing it in our fabric cluster, it is working and we see all the requests in the cluster. What we're finding strange is the time to make remoting V2 requests between services that makes little sense. For example here's a snapshot of what that looks like for 1 request to our service.

image

From what I can tell, it is telling me that the remoting calls are taking ~1.2s each way for a request and its response to travel between 2 services in our cluster. Is this what it's telling me, because this seems extremely odd for remoting V2. Our cluster is running .NET Core 2.2 micro services on the latest version of service fabric.

brahmnes commented 5 years ago

Hi @ththiem,

When you said it's odd, do you mean 2.5 seconds round trip is unreasonable long and you believe it's not measuring correctly? As a quick experiment, you can log separate before and after the call and see if the duration agree.

ththiem commented 5 years ago

Sorry I haven't been able to return to this issue the last few days. I'm going through and trying a before & after log to check now.

ththiem commented 5 years ago

We've been running a lot of tests to poke at this, and there is clearly something that doesn't make sense with what we're seeing in the logs, or something we don't understand. As a simple example, we tracked an event with a start and end time before and after an await call that just makes a RPC request to another service in our fabric (all of which is awaited), and we see conflicting timestamps between our custom events and the automatic dependencies. Here's a (simplified) snippet of our code:

public async Task StartReadAsync()
{
    using (var dataOperation = telemetry.StartOperation<RequestTelemetry>($"ProcessingBufferData"))
    {
        try
        {
            var timeTelemetry = new EventTelemetry("ProcessTime");
            timeTelemetry.Properties["StartTime"] = DateTime.UtcNow.ToString("o");
            ...
            await ProcessBinaryData(streamData).ConfigureAwait(false);
            ...
            timeTelemetry.Properties["EndTime"] = DateTime.UtcNow.ToString("o");
            telemetry.TrackEvent(timeTelemetry);
        }
        catch (Exception ex)
        {
            dataOperation.Telemetry.ResponseCode = "500";
            telemetry.TrackException(ex);
        }
    }
}

public async Task ProcessBinaryData(CallStreamData streamData)
{
    var timedEventTelemetry = new EventTelemetry("BinaryCall");
    timedEventTelemetry.Properties["Start"] = DateTime.UtcNow.ToString("o");

    await SendBinaryDataByRemoting(streamData).ConfigureAwait(false);

    timedEventTelemetry.Properties["End"] = DateTime.UtcNow.ToString("o");
    telemetry.TrackEvent(timedEventTelemetry);
}

private async Task SendBinaryDataByRemoting(StreamData streamData)
{
      try
      {
           var service = ServiceProxy.Create<ISender>(RemoteUri);
           await service.SendStreamDataAsync(streamData).ConfigureAwait(false);
       }
       catch (Exception e)
       {
           telemetry.TrackException(e);
       }
}

And what the logs show: image

image

image

image

Somehow, somewhere, 3 seconds is just "lost." Start and end times are within ~40ms of each other, but the actual call to the other service doesn't start until 3s later, even though the calls are awaited

yantang-msft commented 5 years ago

@ththiem So which telemetry makes more sense to you, the automatic dependency with 3 sec or your manual telemetry with 40ms? I'm wondering whether the automatic/manual telemetry you're looking at are for the same transaction, e.g., it might be slow for the first call to initialize something (3s) but fast (40ms) afterwards.

ththiem commented 5 years ago

Yes they're the same transaction, all under the same operation Id. We have full correlation illustrating this problem. Right now, neither of the telemetry make any sense. It's not queuing the remoting call to run in a background thread and continuing without waiting. It is awaiting it, the code should run serially. The end timestamp should be recorded after the awaited remoting call completes. Instead it's timestamped before the call is even made? It makes no sense, and is making the metrics rather useless to look at since the numbers don't appear to be accurate nor reliable.

yantang-msft commented 5 years ago

This is weird. The simplified code you posted here just looks fine. If you can create a simple project that reproduce this issue, I can take a look and do some debugging.