Closed forensicmatt closed 4 years ago
@forensicmatt I've actually noticed this behavior, but didn't have time to investigate it. I'm afraid this fix will actually damage another functionallity I've added in a previous version - https://github.com/omerbenamram/evtx/blob/master/CHANGELOG.md#060---2019-11-26
Where the XML can contain nodes with the same name - but this doesn't map to JSON.
The code I've used for the fix works correctly, but will produce these empty values only when using the seperate-json-attributes
flag - since I'm kind of relying on the empty value to know that I need to insert a new value.
If you could first add a (failing?) test that demonstrates this use case (based on this xml - https://github.com/omerbenamram/evtx/blob/master/samples/event_with_entity_ref.xml), we could try taking it from there.
Instead of
{
"HTTPResponseHeadersInfo": {
"Header": "x-ms-blob-type: BlockBlob",
"Header_1": "HTTP/1.1 200 OK",
"Header_10": "x-ms-version: 2009-09-19",
"Header_11": "x-ms-lease-status: unlocked",
"Header_2": "Connection: keep-alive",
"Header_3": "Date: Thu, 18 May 2017 11:37:58 GMT",
"Header_4": "Content-Length: 813",
"Header_5": "Content-Type: application/pkix-crl",
"Header_6": "Last-Modified: Tue, 02 May 2017 22:24:24 GMT",
"Header_7": "ETag: 0x8D491A9FD112A27",
"Header_8": "Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0",
"Header_9": "x-ms-request-id: 477c132d-0001-0045-443b-c49ae1000000"
}
}
Should this be an array of element objects based off of element name existing more than once?
{
"HTTPResponseHeadersInfo": [
{
"Header": "x-ms-blob-type: BlockBlob"
}, {
"Header": "HTTP/1.1 200 OK"
}, {
"Header": "x-ms-version: 2009-09-19"
}, {
"Header": "x-ms-lease-status: unlocked"
}, {
"Header": "Connection: keep-alive"
}, {
"Header": "Date: Thu, 18 May 2017 11:37:58 GMT"
}, {
"Header": "Content-Length: 813"
}, {
"Header": "Content-Type: application/pkix-crl"
}, {
"Header": "Last-Modified: Tue, 02 May 2017 22:24:24 GMT"
}, {
"Header": "ETag: 0x8D491A9FD112A27"
}, {
"Header": "Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0"
}, {
"Header": "x-ms-request-id: 477c132d-0001-0045-443b-c49ae1000000"
}
]
}
This represents the actual XML in a more 1 to 1 manor I feel. It would also eliminate relying on the key name of a valueless element.
Added passing test based on this xml - https://github.com/omerbenamram/evtx/blob/master/samples/event_with_entity_ref.xml. In these examples it works as expected because each element (with name Header) has a value. I think we would see some skewed results if all elements had the same name, but some had attributes/values but others didn't. I am not sure if that is a valid scenario, I have yet to see the case. I have ran the changes against almost 5000 event logs which produced 29+million json line event records. I ran this before and after these changes, and made a custom tool to validate the attributes counts still matched up:
With Empties
Total Lines: 29778028, Populated Attributes: 1015209628, Empty Attributes: 102872847
Without Empties
Total Lines: 29778028, Populated Attributes: 1015209628, Empty Attributes: 0
This actually saved a couple gigabytes in the output by removing the empty values.
@forensicmatt hopefully the parser didn't error on any of those right? I think this fix is correct :) since this issue only effects the sepratate attributes flow I'm comfortable with it.
As for the headers thing - the reason I've opted for the map with the keys is to keep consistency with XML structures having only one header for example (which basically the happy flow), so users will always access the Map via:
This also is less of a breaking change coming from indices who've already ingested some data where this previously was a map rather than a list.
There isn't a definite answer on XML->JSON conversion, but this felt OK. I'm open to carefully adding flags the XML to JSON conversion scheme, but we'll need to see how much it will complicate the implementation.
So if you've checked this works on your files I think this can be merged.
If we had some type of XML string to JSON function we could test a few possible scenarios. I would prefer to test additional scenarios, but those can always be added as "features" later 😆 . The change did allow me to go from only indexing 24 million to 29 million.
No errors between the previous version and the new implementation as indicated by the total attribute counts.
My concern is this type of scenario:
<HTTPResponseHeadersInfo>
<Header attribute1="NoProxy"></Header>
<Header>HTTP/1.1 200 OK</Header>
</HTTPResponseHeadersInfo>
Producing:
"HTTPResponseHeadersInfo": {
"Header_attributes": {"attribute1": "NoProxy"},
"Header": "HTTP/1.1 200 OK"
}
and not:
"HTTPResponseHeadersInfo": {
"Header_1_attributes": {"attribute1": "NoProxy"},
"Header_2": "HTTP/1.1 200 OK"
}
But I think most xml -> json libs produce multiple elements with the same name as an array.
For example:
import xmltodict
import json
xml_dict = xmltodict.parse("""
<HTTPResponseHeadersInfo>
<Header attribute1="NoProxy"></Header>
<Header>HTTP/1.1 200 OK</Header>
</HTTPResponseHeadersInfo>
""")
print("{}".format(json.dumps(xml_dict, indent=2)))
Produces:
{
"HTTPResponseHeadersInfo": {
"Header": [
{
"@attribute1": "NoProxy"
},
"HTTP/1.1 200 OK"
]
}
}
This also is less of a breaking change coming from indices who've already ingested some data where this previously was a map rather than a list.
If we are in the context of elastic here, elastic actually treats arrays the same a individual values (objects or concrete (but not both)).
If we have record 1:
<HTTPResponseHeadersInfo>
<Header>HTTP/1.1 200 OK</Header>
</HTTPResponseHeadersInfo>
Which produces:
"HTTPResponseHeadersInfo": {
"Header": "HTTP/9999 300 NOT OK"
}
And record 2
<HTTPResponseHeadersInfo>
<Header attribute1="NoProxy"></Header>
<Header>HTTP/1.1 200 OK</Header>
<Header attribute1="WithProxy">HTTP/9999 300 NOT OK</Header>
</HTTPResponseHeadersInfo>
Which produces:
"HTTPResponseHeadersInfo": [{
"Header_attributes": {
"attribute1": "NoProxy"
}
}, {
"Header": "HTTP/1.1 200 OK"
} {
"Header": "HTTP/9999 300 NOT OK",
"Header_attributes": {
"attribute1": "WithProxy"
}
}
]
These are both valid, and both records would be searched the same for a query like "HTTPResponseHeadersInfo.Header: OK". That being said, this does not work for NoSQL solution that do not have this type of functionality like Mongo or Arango. But this is indeed a down fall of XML format (or at least the conversion of XML to another format).
I would like to make a change that would ensure this:
<HTTPResponseHeadersInfo>
<Header attribute1="NoProxy"></Header>
<Header>HTTP/1.1 200 OK</Header>
</HTTPResponseHeadersInfo>
"HTTPResponseHeadersInfo": {
"Header_1_attributes": {"attribute1": "NoProxy"},
"Header_2": "HTTP/1.1 200 OK"
}
Unfortunately, because I cannot find any events that match these scenarios, I cant test and I dont want to make any changes I can't test. If you add a XML string to JSON function, we can generate these scenarios.
I've added the utility in a branch called feature/json-unit-tests
.
It does seem the last case you've mentioned is misbehaving, you're welcome to merge it into this branch and check it against some samples.
We'll need to put some extra effort into json conversion to tackle these edge cases.
Well, this is indeed a challenge 😄
I'll merge this for now - and release a patch, we could discuss this further in followup PR. Also closing the Issues #73 #71.
When using separate json attributes, if the element's value is empty, remove the empty mapping.
Resolves: https://github.com/omerbenamram/evtx/issues/70