jonwagner / EventSourceProxy

EventSourceProxy (ESP) is the easiest way to add scalable Event Tracing for Windows (ETW) logging to your .NET program
Other
97 stars 20 forks source link

EventSourceProxy causes invalid trace output #10

Closed mot256 closed 10 years ago

mot256 commented 10 years ago

We have found that when we we use ESP we get invalid trace output. I have uploaded a test solution to dropbox.

Before you run the test solution, use PerfView or (as in our case) use a custom ETW event consumer implemented with TraceEvent package to monitor/capture events on the "SSE-SCS" EventSource.

You will notice in Program.cs there are 4 events being logged each time you press a key. Once these events are logged we expect the following output to the trace log (simplified here):

ID=1, FormattedMessage=Server root starting.
ID=2, FormattedMessage=Server root started.
ID=3, FormattedMessage=Server root shutting down.
ID=4, FormattedMessage=Server root shut down.

But we get the following output:

ID=1, FormattedMessage=Server root starting.
ID=1034, FormattedMessage=
ID=3, FormattedMessage=Server root shutting down.
ID=4, FormattedMessage=Server root shut down.

Now here is the part that makes me suspect that it is the ESP causing the issue: Un-comment the following line(s) in `ScsEventSource.ServerRoot.cs' and you will notice that the issue disappears (at least for the test scenario):

    [Event(EventIdStart_ServerRoot + 4, Level = EventLevel.Critical, Message = "Failed to start server: {0}")]
    public abstract void ServerRootStartServerError(string error, [TraceAsData]Exception ex);

In our live system the output is even more scrambled (e.g. events appear in the trace event log that never even happened).

This is a real show stopper.

P.S. If you know of a better way to upload attachments, please let me know.

mot256 commented 10 years ago

Just to add another note: you will notice that the test app uses an "in proc" output of the trace event to the console window (using SemanticLogging's ObservableEventListener). You will notice that the output is as expected in the console, but not as expected in the "out of proc" trace consumers. My suspicion is that the manifest files supplied to the ETW framework is not correct.

jonwagner commented 10 years ago

I downloaded the code and ran it by itself. I got events 1-4 correctly. I'll take a look at perfview now.

jonwagner commented 10 years ago

PerfView is showing this in the log, so there is definitely something wrong.

Started: Opening Events Completed: Opening Events (Elapsed Time: 0.008 sec) Started: Scanning Events Completed: Scanning Events (Elapsed Time: 0.011 sec) Exception Occured: System.Exception: Reading past end of event at Diagnostics.Tracing.TraceEvent.GetUnicodeStringAt(Int32 offset) at Diagnostics.Tracing.Parsers.DynamicTraceEventData.PayloadValue(Int32 index) at PerfView.ETLEventSource.d3.MoveNext() at PerfView.EventWindow.<>cDisplayClass1c.b18() at PerfView.StatusBar.<>cDisplayClass8.b__6(Object param0) An exceptional condition occured, see log for details.

jonwagner commented 10 years ago

I was able to reproduce the PerfView issue with the following concrete EventSource implementation. There appears to be an issue related to Opcodes.Start/Stop. I was also able to fix your code in PerfView by changing all of the Start/Stop Opcodes to Opcodes.Info.

If PerfView is having an issue with a concrete (non-ESP) class, then I'm leaning towards a framework/ETW bug. Can you run the following code with your custom trace listener and see what happens? (You may need to add a few more events.)

namespace ESP_Bug { [EventSource(Guid = "fea36202-ef9a-5b43-9f3d-1099f9a29192")] public class ConcreteEventSource : EventSource { [Event(1)] public void A() { WriteEvent(1); }

    // FIX: remove the start opcode
    [Event(5, Opcode = EventOpcode.Start)]
    //[Event(5)]
    public void B(string error, string ex)
    {
        WriteEvent(5, error, ex);
    }
}

class Program
{
    static void Main(string[] args)
    {
        var log = new ConcreteEventSource();

        log.A();
        log.B("error", null);
    }
}

}

mot256 commented 10 years ago

We've tried breaking it with the concrete implementation as you mentioned above (that still include the Start/Stop opcodes), but all works fine. But we have noticed a difference between the manifest being generated by the concrete implementation and the ESP implementation.

With the following concrete implementation:

 [EventSource(Guid = "fea36202-ef9a-5b43-9f3d-1099f9a29192")]
public class ConcreteEventSource : EventSource
{
    public static ConcreteEventSource Log = new ConcreteEventSource();
    [Event(1)]
    public void A()
    {
        WriteEvent(1);
    }

    // FIX: remove the start opcode
    [Event(5, Opcode = EventOpcode.Start)]
    //[Event(5)]
    public void B(string error, string ex)
    {
        WriteEvent(5, error, ex);
    }
}

We get the following generated manifest:

<instrumentationManifest xmlns="http://schemas.microsoft.com/win/2004/08/events">
 <instrumentation xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:win="http://manifests.microsoft.com/win/2004/08/windows/events">
  <events xmlns="http://schemas.microsoft.com/win/2004/08/events">
<provider name="ConcreteEventSource" guid="{fea36202-ef9a-5b43-9f3d-1099f9a29192}" symbol="ConcreteEventSource">
 <tasks>
  <task name="A" value="65533"/>
 </tasks>
 <opcodes>
 </opcodes>
 <events>
  <event value="1" version="0" level="win:Informational" task="A"/>
  <event value="5" version="0" level="win:Informational" opcode="win:Start" template="BArgs"/>
 </events>
 <templates>
  <template tid="BArgs">
   <data name="error" inType="win:UnicodeString"/>
   <data name="ex" inType="win:UnicodeString"/>
  </template>
 </templates>
</provider>
</events>
</instrumentation>
<localization>
 <resources culture="en-US">
  <stringTable>
  </stringTable>
 </resources>
</localization>
</instrumentationManifest>

But with the following ESP implementation:

[EventSource(Guid = "fea36202-ef9a-5b43-9f3d-1099f9a29192")]
public abstract class ESPEventSource : EventSource
{
    public static ESPEventSource Log = EventSourceProxy.EventSourceImplementer.GetEventSourceAs<ESPEventSource>();
    [Event(1)]
    public abstract void A();

    // FIX: remove the start opcode
    [Event(5, Opcode = EventOpcode.Start)]
    //[Event(5)]
    public abstract void B(string error, string ex);
}

We get the following generated manifest:

<instrumentationManifest xmlns="http://schemas.microsoft.com/win/2004/08/events">
 <instrumentation xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:win="http://manifests.microsoft.com/win/2004/08/windows/events">
  <events xmlns="http://schemas.microsoft.com/win/2004/08/events">
<provider name="ESPEventSource" guid="{fea36202-ef9a-5b43-9f3d-1099f9a29192}" symbol="ESPEventSource">
 <tasks>
  <task name="A_Completed" value="65527"/>
  <task name="A_Faulted" value="65528"/>
  <task name="A" value="65533"/>
 </tasks>
 <opcodes>
 </opcodes>
 <keywords>
  <keyword name="A"  message="$(string.keyword_A)" mask="0x1"/>
  <keyword name="B"  message="$(string.keyword_B)" mask="0x2"/>
 </keywords>
 <events>
  <event value="1" version="0" level="win:Informational" keywords="A" task="A"/>
  <event value="6" version="0" level="win:Error" message="$(string.event_A_Faulted)" keywords="A" task="A_Faulted" template="A_FaultedArgs"/>
  <event value="7" version="0" level="win:Informational" message="$(string.event_A_Completed)" keywords="A" task="A_Completed"/>
  <event value="5" version="0" level="win:Informational" keywords="B" opcode="win:Start" template="BArgs"/>
  <event value="8" version="0" level="win:Error" message="$(string.event_B_Faulted)" keywords="B" opcode="win:Start" template="B_FaultedArgs"/>
  <event value="9" version="0" level="win:Informational" message="$(string.event_B_Completed)" keywords="B" opcode="win:Start"/>
 </events>
 <templates>
  <template tid="A_FaultedArgs">
   <data name="" inType="win:UnicodeString"/>
  </template>
  <template tid="BArgs">
   <data name="error" inType="win:UnicodeString"/>
   <data name="ex" inType="win:UnicodeString"/>
  </template>
  <template tid="B_FaultedArgs">
   <data name="" inType="win:UnicodeString"/>
  </template>
 </templates>
</provider>
</events>
</instrumentation>
<localization>
 <resources culture="en-US">
  <stringTable>
   <string id="event_A_Completed" value=""/>
   <string id="event_A_Faulted" value="%1"/>
   <string id="event_B_Completed" value=""/>
   <string id="event_B_Faulted" value="%1"/>
   <string id="keyword_A" value="A"/>
   <string id="keyword_B" value="B"/>
  </stringTable>
 </resources>
</localization>
</instrumentationManifest>

I do not know whether the difference in manifest is causing the issue, but I have a very big concern that extra event methods are generated by ESP (ending in "_Faulted" and "_Completed") that we firstly did not intend to be there and secondly (but even more of a problem) is that the generated event methods take event IDs that are not used in the abstract EventSource class. The problem with this is that, say we have events ending at event ID 5, now ESP generates the event methods ending in "_Faulted" and "_Completed" taking event ID 6 to what ever. In the future we add our own event ID 6, but now the manifest will not be backward compatible with the previous manifest (remember ID 6 previously had meta data of the auto generated "_Faulted" or "_Completed" method). This can particularly be a problem with DB logging event consumers that use the meta data in the manifest for the schema of log tables.

jonwagner commented 10 years ago

The extra methods are there for situations when you are creating a tracing proxy. I'm pretty sure I can get rid of them when you're not using that mode. I've opened Issue #11 for that.

The only thing odd that I see in the manifest is the templates for the faulted methods:

<template tid="A_FaultedArgs">
    <data name="" inType="win:UnicodeString"/>
</template>

It may be that parsers are getting confused by the nameless parameters.

It is also possible that the extra methods are currently generating ID conflicts, and that is what is confusing the parsers.

Next steps:

  1. If you can send me the code to your custom ETW event consumer, I will use that for testing so we can make sure that we are working on the same problem. Email to: jonwagner at hotmail
  2. Try getting the code for ESP, and commenting out TypeImplementer.cs, lines 261-262. That's where it creates the faulted/completed methods. It will still work without them (although it won't trace those events for proxies, but you're probably not using that feature).
  3. I'll work on Issue #11.
jonwagner commented 10 years ago

The latest code on github has 2 and 3 fixed.

jonwagner commented 10 years ago

Good news! (I think).

a) I can reproduce the problem with your ETW consumer. b) The code currently in github (not yet in nuget) appears to fix the problem.

Now to track down the issue so I understand exactly what the issue is.

jonwagner commented 10 years ago

Very interesting...I'm using Reflector to step through Microsoft.Diagnostics.Tracing.Parsers.ProviderManifest. It's the code that goes through the manifest and generates the runtime data. It appears that the manifest correctly gets over to the ETW client, but gets garbled there. More to come...

mot256 commented 10 years ago

Thank you very much for your efforts to try find the cause of this issue. In the meantime I need to work on other pressing parts of our project. Please let me know if you need anything else.

jonwagner commented 10 years ago

Bad news: it's a bug in Microsoft.Diagnostics.Tracing.TraceEvent (IMHO).

I will explain:

  1. An ETW manifest may omit defining Tasks for events. (Of course, somehow, default EventSources always sneak them in, so you would probably never see them.)
  2. When ProviderManifest.ParseProviderEvents parses the manifest, if an event does not have a Task assigned, the event's GUID gets set to the GUID of the manifest. (I'm not sure why, but perhaps to trigger step 3).
  3. When TraceEventDispatcher.RegisterEventTemplateImpl looks at the template, if taskGuid != Guid.Empty, then lookupAsClassic is set to true.
  4. In TraceEventDispatcher.Insert, if lookupAsClassic==true && lookupAsWPP=false, then the EventID is set to the Opcode of the event. In our case, event 134 (an ESP generated event), gets set to Opcode.Stop (2). BTW - you already have an event with ID=2!
  5. Then TraceEventDispatcher.Insert puts the templates into a hashtable with a linked list for collisions. 134-renamed-to-2 & 2 get put in the same bucket.
  6. In TraceEventDispatcher.Lookup, when an event is looked up, it assumes a bucket actually corresponds to a particular event, and pulls back the entire linked list of events.
  7. TraceEventDispatcher.Dispatch calls the Action method of each event in the linked list without filtering on it by ID.
  8. Then TraceEventDispatcher.Dispatch calls EveryEvent on the head of the chain, but EveryEvent does not walk the chain.

I think the actual bug is that step #3 reassigns an ID to an opcode, which is highly likely to collide, and the rest of the code assumes there is no collision.

jonwagner commented 10 years ago

And here's the answer:

  1. The framework does not generate Task info if you specify an Opcode.

Then see steps 1-8 above.

Proof code:

The following will emit two events with "B" as the description.

[EventSource(Name = "SSE-SCS")]
public class ScsEventSource : EventSource
{
    [Event(134, Message = "B", Opcode = EventOpcode.Stop)]
    public void B()
    {
        WriteEvent(134);
    }

    [Event(2, Message = "A")]
    public void A()
    {
        WriteEvent(2);
    }
}

class Program
{
    static void Main(string[] args)
    {
        var log = new ScsEventSource();
        log.A();
        log.B();
    }
}

So it seems to have nothing to do with ESP, other than the fact that it generated additional events that collided.

My recommendation:

  1. Get a fix from Microsoft. I know someone in the P&P group, so I'll send him an email and perhaps that will expedite it. (Or use a different ETW client, but they are a pain to write, so I doubt there is another good option.)
  2. If you are going to use EventOpcodes, move all of your event IDs above 240. The ETW client seems to register one record under the ID and another under the opcode. Events are still looked up by ID, not opcode, so you can ignore the collision data.

There is nothing to fix in ESP at this point, but v2.0.1 (next build, will not generate extra methods if you derive your class from EventSource, see issue #11.)

It's 2AM now, so it's time for bed...

jonwagner commented 10 years ago

I think I have your answer. See github for the details.

p.s. my brain hurts.

From: Francois Nel [mailto:notifications@github.com] Sent: Friday, November 22, 2013 1:17 AM To: jonwagner/EventSourceProxy Cc: Jon Wagner Subject: Re: [EventSourceProxy] EventSourceProxy causes invalid trace output (#10)

Thank you very much for your efforts to try find the cause of this issue. In the meantime I need to work on other pressing parts of our project. Please let me know if you need anything else.

— Reply to this email directly or view it on GitHub https://github.com/jonwagner/EventSourceProxy/issues/10#issuecomment-29051150 .