log2timeline / plaso

Super timeline all the things
https://plaso.readthedocs.io
Apache License 2.0
1.74k stars 354 forks source link

add parser for Windows Event Log XML output #442

Open joachimmetz opened 8 years ago

joachimmetz commented 8 years ago

Note that Windows Event Log XML output (as exported by Windows EventViewer) is not necessary proper XML. Also see: https://github.com/dfirlabs/evtx-specimens and https://github.com/log2timeline/plaso/issues/3595

joachimmetz commented 5 years ago

A reason to prioritize this https://blog.fox-it.com/2019/06/04/export-corrupts-windows-event-log-files/ This was solved in another way.

joachimmetz commented 3 years ago

Since "Window Event Log XML output" is not always proper XML (see https://github.com/dfirlabs/evtx-specimens), might be better to have libewf/pyevtx expose string/value names

joachimmetz commented 3 years ago

Cannot parse "named values" from the XML since value names are not unique, see example below

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Ntfs" Guid="{3FF37A1C-A68D-4D6E-8C9B-F79E8B16C482}"/>
    <EventID>146</EventID>
    <Version>1</Version>
    <Level>4</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x4000000000200000</Keywords>
    <TimeCreated SystemTime="2018-09-05T01:51:32.107675800Z"/>
    <EventRecordID>3291</EventRecordID>
    <Correlation/>
    <Execution ProcessID="4" ThreadID="15136"/>
    <Channel>Microsoft-Windows-Ntfs/Operational</Channel>
    <Computer>hostname</Computer>
    <Security UserID="S-1-5-18"/>
  </System>
  <EventData>
    <Data Name="VolumeCorrelationId">{4994EA25-0000-0000-0000-501F00000000}</Data>
    <Data Name="VolumeNameLength">2</Data>
    <Data Name="VolumeName">C:</Data>
    <Data Name="IsBootVolume">true</Data>
    <Data Name="HighIoLatencyCount">0</Data>
    <Data Name="IntervalDurationUs">3750962784</Data>
    <Data Name="NCReadIOCount">206</Data>
    <Data Name="NCReadTotalBytes">9112304</Data>
    <Data Name="NCReadAvgLatencyNs">11765267</Data>
    <Data Name="NCWriteIOCount">2858</Data>
    <Data Name="NCWriteTotalBytes">36175488</Data>
    <Data Name="NCWriteAvgLatencyNs">4904584</Data>
    <Data Name="FileFlushCount">560</Data>
    <Data Name="FileFlushAvgLatencyNs">14539870</Data>
    <Data Name="VolumeFlushCount">0</Data>
    <Data Name="VolumeFlushAvgLatencyNs">0</Data>
    <Data Name="FileLevelTrimCount">0</Data>
    <Data Name="FileLevelTrimTotalBytes">0</Data>
    <Data Name="FileLevelTrimExtentsCount">0</Data>
    <Data Name="FileLevelTrimAvgLatencyNs">0</Data>
    <Data Name="VolumeTrimCount">38</Data>
    <Data Name="VolumeTrimTotalBytes">8957952</Data>
    <Data Name="VolumeTrimExtentsCount">303</Data>
    <Data Name="VolumeTrimAvgLatencyNs">53391</Data>
    <Data Name="IoBucketsCount">44</Data>
    <Data Name="TotalBytesBucketsCount">40</Data>
    <Data Name="ExtentsBucketsCount">1</Data>
    <Data Name="IoCount">59</Data>
    <Data Name="IoCount">21</Data>
    <Data Name="IoCount">19</Data>
    <Data Name="IoCount">1</Data>
    <Data Name="IoCount">29</Data>
    <Data Name="IoCount">9</Data>
    <Data Name="IoCount">4</Data>
    <Data Name="IoCount">2</Data>
    <Data Name="IoCount">31</Data>
    <Data Name="IoCount">477</Data>
    <Data Name="IoCount">10</Data>
    <Data Name="IoCount">6</Data>
    <Data Name="IoCount">243</Data>
    <Data Name="IoCount">24</Data>
    <Data Name="IoCount">29</Data>
    <Data Name="IoCount">2169</Data>
    <Data Name="IoCount">167</Data>
    <Data Name="IoCount">60</Data>
    <Data Name="IoCount">10</Data>
    <Data Name="IoCount">255</Data>
    <Data Name="IoCount">24</Data>
    <Data Name="IoCount">29</Data>
    <Data Name="IoCount">4</Data>
    <Data Name="IoCount">1</Data>
    <Data Name="IoCount">10</Data>
    <Data Name="IoCount">143</Data>
    <Data Name="IoCount">16</Data>
    <Data Name="IoCount">761</Data>
    <Data Name="IoCount">8</Data>
    <Data Name="IoCount">4</Data>
    <Data Name="IoCount">3</Data>
    <Data Name="IoCount">63</Data>
    <Data Name="IoCount">43</Data>
    <Data Name="IoCount">12</Data>
    <Data Name="IoCount">1</Data>
    <Data Name="IoCount">51</Data>
    <Data Name="IoCount">12</Data>
    <Data Name="IoCount">6</Data>
    <Data Name="IoCount">3</Data>
    <Data Name="IoCount">38</Data>
    <Data Name="IoCount">364</Data>
    <Data Name="IoCount">161</Data>
    <Data Name="IoCount">22</Data>
    <Data Name="IoCount">13</Data>
    <Data Name="TotalLatencyUs">151222351</Data>
    <Data Name="TotalLatencyUs">252569191</Data>
    <Data Name="TotalLatencyUs">939055921</Data>
    <Data Name="TotalLatencyUs">158486373</Data>
    <Data Name="TotalLatencyUs">66212962</Data>
    <Data Name="TotalLatencyUs">100870294</Data>
    <Data Name="TotalLatencyUs">211584800</Data>
    <Data Name="TotalLatencyUs">412243752</Data>
    <Data Name="TotalLatencyUs">1024939</Data>
    <Data Name="TotalLatencyUs">122116410</Data>
    <Data Name="TotalLatencyUs">173990480</Data>
    <Data Name="TotalLatencyUs">376778004</Data>
    <Data Name="TotalLatencyUs">346149560</Data>
    <Data Name="TotalLatencyUs">293983207</Data>
    <Data Name="TotalLatencyUs">1744641916</Data>
    <Data Name="TotalLatencyUs">2430678982</Data>
    <Data Name="TotalLatencyUs">2125350566</Data>
    <Data Name="TotalLatencyUs">3323127762</Data>
    <Data Name="TotalLatencyUs">1492911444</Data>
    <Data Name="TotalLatencyUs">328400556</Data>
    <Data Name="TotalLatencyUs">282485016</Data>
    <Data Name="TotalLatencyUs">1742101317</Data>
    <Data Name="TotalLatencyUs">3490269</Data>
    <Data Name="TotalLatencyUs">6073390</Data>
    <Data Name="TotalLatencyUs">15018274</Data>
    <Data Name="TotalLatencyUs">13603300</Data>
    <Data Name="TotalLatencyUs">2411815106</Data>
    <Data Name="TotalLatencyUs">51939076</Data>
    <Data Name="TotalLatencyUs">93602851</Data>
    <Data Name="TotalLatencyUs">294483214</Data>
    <Data Name="TotalLatencyUs">336671922</Data>
    <Data Name="TotalLatencyUs">171943533</Data>
    <Data Name="TotalLatencyUs">496449612</Data>
    <Data Name="TotalLatencyUs">582219031</Data>
    <Data Name="TotalLatencyUs">112693514</Data>
    <Data Name="TotalLatencyUs">132080857</Data>
    <Data Name="TotalLatencyUs">133962601</Data>
    <Data Name="TotalLatencyUs">436899980</Data>
    <Data Name="TotalLatencyUs">332814102</Data>
    <Data Name="TotalLatencyUs">2028862</Data>
    <Data Name="TotalLatencyUs">610403162</Data>
    <Data Name="TotalLatencyUs">3985570197</Data>
    <Data Name="TotalLatencyUs">1473026538</Data>
    <Data Name="TotalLatencyUs">2073327422</Data>
    <Data Name="TotalBytes">2813952</Data>
    <Data Name="TotalBytes">884736</Data>
    <Data Name="TotalBytes">1449984</Data>
    <Data Name="TotalBytes">20480</Data>
    <Data Name="TotalBytes">3085312</Data>
    <Data Name="TotalBytes">770048</Data>
    <Data Name="TotalBytes">708608</Data>
    <Data Name="TotalBytes">569344</Data>
    <Data Name="TotalBytes">72056</Data>
    <Data Name="TotalBytes">1951263</Data>
    <Data Name="TotalBytes">34331</Data>
    <Data Name="TotalBytes">27388</Data>
    <Data Name="TotalBytes">1638024</Data>
    <Data Name="TotalBytes">221184</Data>
    <Data Name="TotalBytes">192512</Data>
    <Data Name="TotalBytes">19996362</Data>
    <Data Name="TotalBytes">2463478</Data>
    <Data Name="TotalBytes">1144376</Data>
    <Data Name="TotalBytes">167936</Data>
    <Data Name="TotalBytes">1687176</Data>
    <Data Name="TotalBytes">221184</Data>
    <Data Name="TotalBytes">192512</Data>
    <Data Name="TotalBytes">200704</Data>
    <Data Name="TotalBytes">65536</Data>
    <Data Name="TotalBytes">68608</Data>
    <Data Name="TotalBytes">6077765</Data>
    <Data Name="TotalBytes">262144</Data>
    <Data Name="TotalBytes">20814515</Data>
    <Data Name="TotalBytes">592896</Data>
    <Data Name="TotalBytes">379904</Data>
    <Data Name="TotalBytes">352256</Data>
    <Data Name="TotalBytes">820224</Data>
    <Data Name="TotalBytes">3174400</Data>
    <Data Name="TotalBytes">229376</Data>
    <Data Name="TotalBytes">122880</Data>
    <Data Name="TotalBytes">2166976</Data>
    <Data Name="TotalBytes">1638400</Data>
    <Data Name="TotalBytes">428592</Data>
    <Data Name="TotalBytes">196608</Data>
    <Data Name="TotalBytes">8957952</Data>
    <Data Name="TrimExtentsCount">303</Data>
    <Data Name="IoTypeIndex">0</Data>
    <Data Name="IoTypeIndex">1</Data>
    <Data Name="IoTypeIndex">2</Data>
    <Data Name="IoTypeIndex">3</Data>
    <Data Name="IoTypeIndex">4</Data>
    <Data Name="IoTypeIndex">5</Data>
    <Data Name="IoTypeIndex">6</Data>
    <Data Name="IoTypeIndex">7</Data>
    <Data Name="IoTypeIndex">16</Data>
    <Data Name="IoTypeIndex">20</Data>
    <Data Name="IoTypeIndex">21</Data>
    <Data Name="IoTypeIndex">22</Data>
    <Data Name="IoTypeIndex">28</Data>
    <Data Name="IoTypeIndex">29</Data>
    <Data Name="IoTypeIndex">30</Data>
    <Data Name="IoTypeIndex">36</Data>
    <Data Name="IoTypeIndex">37</Data>
    <Data Name="IoTypeIndex">38</Data>
    <Data Name="IoTypeIndex">39</Data>
    <Data Name="IoTypeIndex">44</Data>
    <Data Name="IoTypeIndex">45</Data>
    <Data Name="IoTypeIndex">46</Data>
    <Data Name="IoTypeIndex">48</Data>
    <Data Name="IoTypeIndex">49</Data>
    <Data Name="IoTypeIndex">52</Data>
    <Data Name="IoTypeIndex">56</Data>
    <Data Name="IoTypeIndex">59</Data>
    <Data Name="IoTypeIndex">60</Data>
    <Data Name="IoTypeIndex">61</Data>
    <Data Name="IoTypeIndex">62</Data>
    <Data Name="IoTypeIndex">63</Data>
    <Data Name="IoTypeIndex">64</Data>
    <Data Name="IoTypeIndex">65</Data>
    <Data Name="IoTypeIndex">66</Data>
    <Data Name="IoTypeIndex">67</Data>
    <Data Name="IoTypeIndex">68</Data>
    <Data Name="IoTypeIndex">69</Data>
    <Data Name="IoTypeIndex">70</Data>
    <Data Name="IoTypeIndex">71</Data>
    <Data Name="IoTypeIndex">0</Data>
    <Data Name="IoTypeIndex">4</Data>
    <Data Name="IoTypeIndex">5</Data>
    <Data Name="IoTypeIndex">6</Data>
    <Data Name="IoTypeIndex">7</Data>
  </EventData>
</Event>

Likely would need the WEVT templates here

rj-chap commented 2 years ago

When dealing with "named values" that are not unique, has the team thought about enumeration of field names (i.e. creating IoTypeIndex_1, IoTypeIndex_2, etc.) or putting multi-values into an object (a non-flat version of handling the data)? Another option would be to extract/create a single field with a constant that basically states, "many values for this field, see xml_string" or something similar. Or perhaps a combination of the above methods.

Am I correct in thinking that going the "WEVT templates" route would thus require a template for every possible event? If so would the team be amenable to the community providing (via survey or other) the top X events that they'd like to have parsed to begin the fun?

joachimmetz commented 2 years ago

? If so would the team be amenable to the community providing (via survey or other) the top X events that they'd like to have parsed to begin the fun?

Who is "the team" ? What "community" ? What are "top X events" for what purpose ? Who is going to do the survey, you?

This is an open source project feel free to contribute.

putting multi-values into an object (a non-flat version of handling the data)?

There are strings this is the most reliable way.

This does not address that [From https://osdfir.blogspot.com/2021/10/common-misconceptions-about-windows.html]:

How the other event strings should be interpreted in such cases based on the EventLog record, without additional facts, is pure speculation from a digital forensics point-of-view.

More research is needed what the behavior of what each of these "values" means in the context of a specific version of an event log provider.

Another option would be to extract/create a single field with a constant that basically states, "many values for this field, see xml_string" or something similar.

Unclear to me what you mean, please elaborate

Am I correct in thinking that going the "WEVT templates" route would thus require a template for every possible event?

No, not every event has a corresponding WEVT_TEMPLATE resource

rj-chap commented 2 years ago

Thanks for the feedback @joachimmetz. I'll tap out.

joachimmetz commented 2 years ago

I'll tap out.

@rj-chap can you be more clear and explain your expressions. This is an international audience, not native English speakers, and I don't assume you mean https://www.urbandictionary.com/define.php?term=tap%20out

rj-chap commented 2 years ago

@rj-chap can you be more clear and explain your expressions. This is an international audience, not native English speakers, and I don't assume you mean https://www.urbandictionary.com/define.php?term=tap%20out

This is now my favorite GitHub issue ever. 100%.

I just meant I would leave the conversation and watch the development unfold. But we've gone this far, might as well keep going! No way I can resist after a fantastic response such as that.

There are strings this is the most reliable way. This does not address that [From https://osdfir.blogspot.com/2021/10/common-misconceptions-about-windows.html]:

How the other event strings should be interpreted in such cases based on the EventLog record, without additional facts, is pure speculation from a digital forensics point-of-view.

As you note in your linked article, correlation between the values (or params) and their location(s) within (or even association to) event message strings are often convoluted. Personally my focus is on raw name:value extraction. For example in the following:

`

Volume Shadow Copy
<Data Name="param2">stopped</Data>

`

I'd love to see param1:Volume Shadow Copy and param2:stopped extracted.

In your linked reference, I would like to see the raw field data, such as:

param1:ScRegSetValueExW param2:FailureActions param3:%%5

Correlation of values such as %%5 can be done by the analyst, say via correlation to the data set at hand. At the very least, pulling these raw values seems like a solid way to provide additional parsed context for the time-being while a more thought-out process could be developed to ensure no values such as %%5 make their way into the logs.

Your thoughts?

joachimmetz commented 2 years ago

As you note in your linked article, correlation between the values (or params) and their location(s) within (or even association to) event message strings are often convoluted.

This comment is related to the event strings (information extracted from eg. EventData and their application in the event message. This is not only convoluted this might be non-existent. Additional information is necessary to indicate so. The event strings used the message string at least have some context to indicate how they should be used.

<EventData> <Data Name="param1">Volume Shadow Copy</Data> <Data Name="param2">stopped</Data> </EventData>

That is only 1 form event strings can be specified as, also see: https://github.com/libyal/libevtx/blob/main/documentation/Windows%20XML%20Event%20Log%20(EVTX).asciidoc#event-data

In your linked reference, I would like to see the raw field data, such as:

There is nothing in the EventXML that makes param1 unique. The EventXML could very well be: <Data Name="param1">stopped</Data> <Data Name="param1">Volume Shadow Copy</Data>

Since they are not unique they cannot be used as keys, see one of the previous examples. Suffixing can be a work-around, but that would mean we're altering data which has its own set of challenges.

Correlation of values such as %%5 can be done by the analyst, say via correlation to the data set at hand.

the %%5 referenced values are completely different/separate from the EventXML EventData attributes

Also what do you mean with "Correlation" in this context? This is just a way of "parameter substitution"

At the very least, pulling these raw values seems like a solid way to provide additional parsed context for the time-being while a more thought-out process could be developed to ensure no values such as %%5 make their way into the logs.

It is unclear to me what problem are you trying to solve or what point you are trying to make. What is the underlying analysis method you are trying to improve? Where is that documented?

rj-chap commented 2 years ago

It is unclear to me what problem are you trying to solve or what point you are trying to make. What is the underlying analysis method you are trying to improve? Where is that documented?

I'd like to begin by stating that I truly appreciate plaso along with all of the work that has gone into the project. I have nothing but respect for everyone who has devoted their time and passion to the project. As such I am not trying to prove any point or indicate that any of my suggestions are "right" or "correct." For that matter I do not intend to cause any consternation on your behalf.

What I am trying to do is provide potential ways to assist the project in the ways that I can. I am not a developer. I am a DFIR analyst. The analysis method that I am attempting to improve is the ability for an analyst to query the data set generated by the tool.

I'm looking at the ingestion of plaso-generated data into a log aggregation and/or SIEM tool for bulk analysis. A simple example being plaso data pushed into Elastic to be analyzed via Kibana or TimeSketch. Whether an analyst uses this bulk analysis approach or simply intends to analyze plaso-generated data within a CSV locally, the goal is to provide the ability to identify data quickly and efficiently.

As-is, analysts can identify Event IDs (event_identifier) easily, which is fantastic. Host information (computer_name), timestamps (datetime), and general extracted strings (strings), and more are available. Super useful. However the current parser does not yet extract certain named values that are critical for analysis.

This is where my ideas in this and my initial Issue (#3988) come into play. For a simple event such as a Microsoft-Windows-Security-Auditing 4625 event, it would be phenomenal to have fields such as TargetUserName, TargetDomainName, etc. parsed out for analysis. Many of the event IDs under the Security provider (e.g. 4720 account creation events) have name fields in the format of:

<Data Name="TargetUserName">TotallyNotADAAccount</Data>
<Data Name="TargetDomainName">SAMARAN</Data>

For situations such as these, it would be absolutely amazing to have these values parsed by plaso's EVTX parser. Right now, the xml_string field must be queried manually to identify this information. In tools like ELK and Splunk, this is done via regex searching through the xml_string field. The requirement of regex'ing through a large field adds a ton of overhead and complexity to the queries required to do something as simple as identify a given account with failed logon attempts.

I think the easiest way to re-frame what I've been suggesting up to now would be to ask if it's possible the EVTX parser be updated to identify and extract these well-name potential fields for now, while additional research is conducted to find a solid way to parse all event log data in a more robust fashion.

joachimmetz commented 2 years ago

The requirement of regex'ing through a large field adds a ton of overhead and complexity to the queries required to do something as simple as identify a given account with failed logon attempts.

Parsing EventXML as XML is not an option, since it is not proper XML. So basically you're asking to move a complex and maintenance heavy solution into Plaso.

While a more robust way is to just map the strings to predefined fields in ELK based on the message identifier / event provider version.

rj-chap commented 2 years ago

Parsing EventXML as XML is not an option, since it is not proper XML. So basically you're asking to move a complex and maintenance heavy solution into Plaso.

Ah, understood. My initial feedback was aimed at discussing ways to handle the more difficult situations, but I'm obviously out of my element. Moving forward in our conversation, I ditched the idea of dealing with the more difficult scenarios and figured a quick fix to provide some additional context would work. I was thinking of looping through each line in the xml_string and regex'ing out the field/values toward the end of _GetEventDataFromRecord(). If a repeat name is found, remove previous and ignore. Would avoid parsing non-unique names.

Abstracting the idea in bash:

<EventData>
 <Data Name="TargetUserName">legitaccount</Data>
 <Data Name="TargetDomainName">redacted</Data>
 <Data Name="TargetSid">S-1-5-21-redacted</Data>
 <Data Name="SubjectUserSid">S-1-5-18</Data>
 <Data Name="SubjectUserName">redacted$</Data>
 <Data Name="SubjectDomainName">redacted</Data>
 <Data Name="SubjectLogonId">0x00000000000003e7</Data>
 <Data Name="PrivilegeList">-</Data>
 <Data Name="SamAccountName">legitaccount</Data>
 <Data Name="DisplayName">%%1793</Data>
...snip...
 <Data Name="UserAccountControl"> %%2080 %%2082 %%2084</Data>
 <Data Name="UserParameters">%%1793</Data>
 <Data Name="SidHistory">-</Data>
 <Data Name="LogonHours">%%1797</Data>
</EventData>
cat event.txt | grep '<Data Name=' | perl -pe 's/<Data Name=\"(.+)\">(.+)<\/Data\>/\1:\2/g'
 TargetUserName:legitaccount
 TargetDomainName:redacted
 TargetSid:S-1-5-21-redacted
 SubjectUserSid:S-1-5-18
 SubjectUserName:redacted$
 SubjectDomainName:redacted
 SubjectLogonId:0x00000000000003e7
 PrivilegeList:-
 SamAccountName:legitaccount
 DisplayName:%%1793
...snip...
 UserAccountControl: %%2080 %%2082 %%2084
 UserParameters:%%1793
 SidHistory:-
 LogonHours:%%1797

On paper, the above seems like a quick fix to provide some additional context while a more solid solution is researched & built. It's easy for us non-developers to think things are "easy" when they are not. I think overall I'm looking for shortcuts that, to a legitimate developer such as yourself, only add maintenance and disorganization. It's not just a loop to identify possible Data Names and values... it's creating a new list to hold the dictionaries, appending to the list when a line matches the expression, looping through the array to unfold the items before returning, updating everything else to deal with additional fields that might be returned, and all the other things I can't think of because I'm not a dev.

Thanks for all the feedback Joachim. The last thing I'd want to do is introduce more headache to the project.

joachimmetz commented 2 years ago

My initial feedback was aimed at discussing ways to handle the more difficult situations, but I'm obviously out of my element.

What do you mean with "more difficult situations"? Data format edge cases?

A general good first step (as developer, analysts what ever hat you wear) understand the data and its edge cases. See https://osdfir.blogspot.com/2020/09/testing-digital-forensic-data.html for a more detailed write up.

If a repeat name is found, remove previous and ignore.

If you do that in your analysis scripts and you're fully aware of this limitation that could be acceptable for the case at hand. IMHO not a method that is applicable for a tool like Plaso used for many different cases.

It's easy for us non-developers to think things are "easy" when they are not.

This has nothing to do with "developers versus non-developer". The fact that someone can code does not (necessarily) make them a developer. The fact that someone can analyze does not (necessarily) make them an analyst.

This is confirmation bias (or tunnel vision) and a very concerning development in the DFIR field.

On paper, the above seems like a quick fix to provide some additional context while a more solid solution is researched & built.

IMHO having a "researched solution" is a prerequisite for a method to be considered forensics in the first place. If one cannot reason about the about the method and findings it produces in a discrete and transparent way, it is not method that should be used in a forensics context.

rj-chap commented 2 years ago

If you do that in your analysis scripts and you're fully aware of this limitation that could be acceptable for the case at hand. IMHO not a method that is applicable for a tool like Plaso used for many different cases.

The case at hand is working an incident. As for this or similar methods' place in plaso, totally understood. This is why entities who perform IR at scale have parsers built on top of plaso or build their own tool outright for parsing. I was attempting to bridge the gap in these cases, as plaso makes things so much simpler for many folks around the world.

IMHO having a "researched solution" is a prerequisite for a method to be considered forensics in the first place. If one cannot reason about the about the method and findings it produces in a discrete and transparent way, it is not method that should be used in a forensics context.

I think overall we differ in opinion on methods used to perform incident response. My general notes are not "forensic" in nature, as you note. But attempting to review every single xml_string field while analyzing hundreds or thousands of hosts is ridiculous. Folks using Splunk can at least eval their own fields out. For the rest of us, especially those relying on ELK, we're stuck pretty hard here. I have a ton of respect for plaso. Was just trying to feed some ideas to facilitate a usable solution. I can see I'm in the wrong place. Thanks for your time.

joachimmetz commented 2 years ago

But attempting to review every single xml_string field while analyzing hundreds or thousands of hosts is ridiculous.

As I said "a more robust way is to just map the strings to predefined fields in ELK based on the message identifier / event provider version."

Again why are you focused (tunneled/scoped) on the xml_string? Using the xml_string is a "broken record" of the only way to scale this problem. It is not.

Think outside the box, you don't need to look at the xml_string do things at scale. Have a read of https://osdfir.blogspot.com/2021/10/pearls-and-pitfalls-of-timeline-analysis.html.

This is why entities who perform IR at scale have parsers built on top of plaso or build their own tool outright for parsing.

There will always be custom needs, but AFAIK "these entities" have not "contributed" to this project as in a PR. There is a lot of tools built on top of other DFIR FOSS tools as well, with very little contribution back to these projects. This is mostly a different problem.

I was attempting to bridge the gap in these cases, as plaso makes things so much simpler for many folks around the world.

Then the best thing you can do is to research this gap. Do not just dump an idea to expect it to be implemented. Do the leg work, help out, contribute, build this "community" you are dreaming of. But do it the right way, not yet another broken analysis methodology based on assumptions.

I think overall we differ in opinion on methods used to perform incident response.

Even in IR these principles apply, if you cannot sufficient rely on your methodologies you make the wrong conclusion. e.g. missing a secondary backdoor, missing a compromised host, exposing system privileges while collecting data, if your mitigation is actually sufficient, etc. especially at scale, independent of technology. Or you create a huge number of (false) positive wearing down your analysts. However this is more a separate holistic conversation than about a single feature.