open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

Windows event_data format is difficult to consume #32952

Open strawgate opened 6 months ago

strawgate commented 6 months ago

Component(s)

receiver/windowseventlog

Describe the issue you're reporting

Prior to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/28587 the format for event_data on windows events was easy to consume as it was a map:

    <EventData>
        <Data Name="ProcessId">7924</Data>
        <Data Name="Application">\device\harddiskvolume2\users\test\desktop\netcat\nc.exe</Data>
        <Data Name="SourceAddress">0.0.0.0</Data>
        <Data Name="SourcePort">5555</Data>
        <Data Name="Protocol">6</Data>
        <Data Name="FilterRTID">84614</Data>
        <Data Name="LayerName">%%14608</Data>
        <Data Name="LayerRTID">36</Data>
    </EventData>

would produce

  "event_data": {
    "ProcessId": "7924",
    "Application": "\\device\\harddiskvolume2\\users\test\\desktop\\netcat\\nc.exe",
    "SourceAddress": "0.0.0.0",
    "SourcePort": "5555",
    "Protocol": "6",
    "FilterRTID": "84614",
    "LayerName": "%%14608",
    "LayerRTID": "36"
  }

In order to support xml entries with no names, this PR https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/28587 switched the format to be an array of maps:

{
  "event_data": {
    "data": [
      {
        "ProcessId": "7924"
      },
      {
        "Application": "\\device\\harddiskvolume2\\users\test\\desktop\\netcat\\nc.exe"
      },
      {
        "SourceAddress": "0.0.0.0"
      },
      {
        "SourcePort": "5555"
      },
      {
        "Protocol": "6"
      },
      {
        "FilterRTID": "84614"
      },
      {
        "LayerName": "%%14608"
      },
      {
        "LayerRTID": "36"
      }
    ]
  }

Which seems to require a mergemaps(?) to be consumable via OTTL (though I'm not sure how that would work with multiple name-less entries).

We should consider handling the happy case (where event_data has a name and value) in a way that makes the data easy to consume.

Perhaps we could generate names (value1, value2, value3) for the name-less values to offer a consistent access experience. This would still allow whatever patterns are used today to identify the right event_data value to read while making it very easy to access by the order of the event_data.

github-actions[bot] commented 6 months ago

Pinging code owners:

strawgate commented 6 months ago

I actually was unable to figure out how to work with the event_data in ottl as merge_maps doesn't work and flatten produces .0 .1 fields. I don't see any ottl function for merging multiple arbitrary maps so I filed https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32954

djaglowski commented 6 months ago

Thanks for reporting this @strawgate. I agree it's difficult to work with but it looks like you have identified a potential change to ottl which would help. Let us know if you have other ideas or requests to help with this.

strawgate commented 6 months ago

Yeah though having everyone have to merge maps to look at event data doesn't seem ideal, especially when there are potential a dozen plus properties with empty keys that won't merge well.

I think we should go back to creating a single map and come up with an incrementing identifier like param1 for those unnamed parameters.

I'd be happy to try to put together a pr

djaglowski commented 6 months ago

Or maybe we just separate named and unnamed parameters - a map for named, an array for unnamed? I don't know if there's appetite to switch it again but maybe @swiatekm-sumo @pjanotti have opinions too.

swiatekm commented 6 months ago

I have very little appetite for a breaking change here, but I wouldn't mind a configuration option to switch between two well-defined formats.

strawgate commented 6 months ago

I'm not sure how you would actually even use the structure that exists today.

Can you provide an example of how you can consume the data today?

As far as I can tell, you can't "find " keys or values that are in a map that's in an arbitrary array of maps so you're relying on pulling maps by the index in the array? Which I don't think is stable for named parameters but is stable for unnamed parameters.

It seems to me that it would be very hard for someone to consume as it is today as they would be taking a dependency on the order of the maps in the array?

In event_data XML these are largely accessed by value or by key and only sometimes by index. Having them in the map would provide a unified access pattern.

Perhaps the event_data unnamed keys being converted to array vs map should be a configuration?

pjanotti commented 6 months ago

Thinking back to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/28587#discussion_r1370863850 I opted to go with maps to keep it compact, ie.: avoiding naming the element fields, but, if we were to follow the actual XML event definition we would need to respect that. So if we were to follow the XML specification it would have to be something like:

{
  "event_data": {
    "data": [
      {
        "name": "ProcessId",
        "value": "7924"
      },
      {
        "name": "Application",
        "value": "\\device\\harddiskvolume2\\users\test\\desktop\\netcat\\nc.exe"
      },
    ]
  }

it means that for the Data elements without name it will be something like:

{
  "event_data": {
    "data": [
      {
        "name": "",
        "value": "7924"
      },
      {
        "name": "",
        "value": "\\device\\harddiskvolume2\\users\test\\desktop\\netcat\\nc.exe"
      },
    ]
  }

The map is really convenient, but, it is not the actual data format. The events without name are the oldest format for Windows events, but, as shown in the comment above are still used.

I want to improve the usage for @strawgate, but, let's decide this carefully to avoid having to revisit it later. 🤔

strawgate commented 6 months ago

I went looking at how this works across other tools, it looks like there are 3 main methods utilized:

  1. Leave it as XML (Loki, Splunk(?), NXlog)
  2. Turn it into a dictionary with something like param1 (telegraf, elastic)
  3. Make it an array of maps with published index -> param (fluentd and fluentbit)

I think this may be a good enough argument on its own that there should be 3 modes and that perhaps leaving it as xml might make sense as the default.

The complaints around 3 (what is currently the default) is typically that it relies on the order of xml elements to produce the array of maps. Lots of event types can contain the same xml elements but in different orders.

A user who wants to pull say "TargetUserSid" using method 3 and work with it using OTTL will have to either:

  1. Use a future find_map that searches an array of maps for a map with a key or a map based on the value of a key
  2. Track the index of "TargetUserSid" across every possible event and every windows version, create some sort of switch statement to pull the right index from the array to get this field
pjanotti commented 6 months ago

One idea regarding a configuration option: a versioned data struct option seems a good choice in this case. It allows users to stay on a desired format while allowing possible future breaking changes. Each version represents relatively a small piece of code and having a handful doesn't seem a problem and we can always drop some less popular options later.

strawgate commented 6 months ago

One idea regarding a configuration option: a versioned data struct option seems a good choice in this case. It allows users to stay on a desired format while allowing possible future breaking changes. Each version represents relatively a small piece of code and having a handful doesn't seem a problem and we can always drop some less popular options later.

Are you aware of any examples of this implemented across any component that I can stare at for a while?

I may start putting together a PR for how this might look, so any guidance would be lovely.

pjanotti commented 6 months ago

@strawgate sorry, I don't know one from the top of my mind. @djaglowski may know some component doing that.

strawgate commented 6 months ago

@swiatekm-sumo do you view the introduction of the current behavior as a setting the user would have to explicitly pick as acceptable re: breaking change? As long as the actual event_data can come in the previous format with that setting?

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

strawgate commented 2 months ago

I'm still a bit stuck on this as using the current format requires taking a dependency on an unstable feature of the event_data format (the order of the values with no keys).

I would like to address this if possible via a configuration flag that determines how event_data is "map"ped. Are there any objections to me putting together a PR that offers the alternative format?

pjanotti commented 2 months ago

@strawgate making a PR will, at minimum, move the discussion ahead, however, there is the obvious risk of not being accepted. If you think the risk is too high for your time investment, let me (re)review this item before you embark on that.

djaglowski commented 1 month ago

I went looking at how this works across other tools, it looks like there are 3 main methods utilized:

  1. Leave it as XML (Loki, Splunk(?), NXlog)
  2. Turn it into a dictionary with something like param1 (telegraf, elastic)
  3. Make it an array of maps with published index -> param (fluentd and fluentbit)

I think this may be a good enough argument on its own that there should be 3 modes and that perhaps leaving it as xml might make sense as the default.

The complaints around 3 (what is currently the default) is typically that it relies on the order of xml elements to produce the array of maps. Lots of event types can contain the same xml elements but in different orders.

With https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34720 merged, I think users will now have a simple way to opt into 1. (This was sort of true before, but the raw flag also controlled whether we fully rendered the event, so the event data may not have been present in the XML.)

I want to offer a 4th option, after having spent some time on https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35281

  1. Convert EventData into the a map, where elements with a Name attributes are key/value pairs, and elements without a Name attributes are items in a list. The list can be a value in the map with a fixed key, e.g. Unnamed

Starting from the original example, say we have a few unnamed elements in the result:

    <EventData>
        <Data Name="ProcessId">7924</Data>
        <Data Name="Application">\device\harddiskvolume2\users\test\desktop\netcat\nc.exe</Data>
        <Data Name="SourceAddress">0.0.0.0</Data>
        <Data Name="SourcePort">5555</Data>
        <Data Name="Protocol">6</Data>
        <Data Name="FilterRTID">84614</Data>
        <Data Name="LayerName">%%14608</Data>
        <Data Name="LayerRTID">36</Data>
        <Data>val 1</Data>
        <Data>val 2</Data>
    </EventData>

The output would be:

  "event_data": {
    "ProcessId": "7924",
    "Application": "\\device\\harddiskvolume2\\users\test\\desktop\\netcat\\nc.exe",
    "SourceAddress": "0.0.0.0",
    "SourcePort": "5555",
    "Protocol": "6",
    "FilterRTID": "84614",
    "LayerName": "%%14608",
    "LayerRTID": "36"
    "Unnamed": [ "val 1", "val 2" ]
  }

We may need to support the old format (3) for backwards compatibility, but I'm wondering if 4 would be superior to 2.

pjanotti commented 1 month ago

One observation @djaglowski, there shouldn't be a mix of named and unnamed entries in the same event. That is related to the code used to generate the actual event. So when doing option 4, if possible, the map can be created with the generated names, perhaps matching the legacy event format, IIRC, %1, %2, ...

strawgate commented 1 month ago

This matches my understanding too that the unnamed ones should be from XP/2003 events.

@pjanotti by generated names do you mean something like param1, param2, param3? Isn't that option 2?

djaglowski commented 1 month ago

there shouldn't be a mix of named and unnamed entries in the same event

Given that, I would be in favor of adding a config flag, e.g. interpret_event_data, which would automatically detect which format is used, and convert into either a map (if names are present) or array otherwise. By making this a config flag, we are not breaking the current behavior again, but users can opt into the improved behavior if they are comfortable with the fact that events from multiple sources may have slightly different schema.

pjanotti commented 1 month ago

Sorry, for the delay...

@strawgate

@pjanotti by generated names do you mean something like param1, param2, param3? Isn't that option 2?

Yes, correct - just using the place holder of the message in the old format.

@djaglowski just to confirm if I'm understanding your proposal correct: if interpret_event_data is false we keep the current format. If true then the code will create a map. If the "name" attribute is empty the keys on the map will be %1, %2, and so on. If the "name" attribute is not empty we will use that as the key in the map.

djaglowski commented 1 month ago

if interpret_event_data is false we keep the current format.

If true then the code will create a map.

I was thinking it would be a map or a slice, depending on whether the name attribute is present.

If the "name" attribute is empty the keys on the map will be %1, %2, and so on.

This is where I would have said the result is a slice containing only the values.

If the "name" attribute is not empty we will use that as the key in the map.


That said, I'm not opposed to having a map w/ some kind of numbered keys. The benefit of this would be that the result is structurally for all events. I think you should make this decision, and the rest we seem to be on the same page about already.

pjanotti commented 1 month ago

@djaglowski I do prefer that we always have a map, this should simplify any ottl processing.

@strawgate any preferences?

strawgate commented 1 month ago

Consistent use of a map seems ideal to me

djaglowski commented 1 month ago

I'm good with using a map consistently.