Open jberkers42 opened 7 years ago
"Since the parsing rules for several SIEM solutions are order-dependent" Can you elaborate on this please?
Hi @keeely,
Thanks for your response.
The specific case I am working on is integrating the event feed from Sophos Central into LogRhythm. The regex parsing rules require the fields to be presented in a consistent order. The log samples I provided show that sometimes the fields are presented in one order, while at another time, the fields are presented in a different order.
If I recall correctly, the McAfee ESM platform and IBM QRadar also require (or at least prefer) the fields to be presented in a consistent order.
Having done a little further research, it is also possible that the issue is caused by Python's dict data structures, which also don't enforce order, just like JSON. I am not that familiar with Python at this time, however, this behaviour is also similar to Perl's Hash data structures, which also don't preserve any order.
I am working with a number of other users via the LogRhythm Forums on a community collaboration to implement integration of Sophos Central into LogRhythm as we have an increasing number of customers seeking this.
Please let me know if you have any further questions.
Regards,
John Berkers :: Senior Security Engineer IPSec Pty Ltd
I think you are right about the python dict structure as the enumeration order is undefined. We have keyvalue, cef and json as outputs. Would you be wanting one, or all of these to be output in sorted order?
If you check the Python documentation the json serialiser appears to have a sort_keys option, which might help: https://docs.python.org/2/library/json.html
e.g. in siem.py line 258 siem_logger.info(json.dumps(i, ensure_ascii=False) + u'\n', sort_keys=True) (untested!)
I'm relatively new to SIEM and it's somewhat baffling that anyone would consider using keyvalue or cef as a serialization format because they appear to be ambiguously defined. I was looking for a standard library to parse these formats outside of a SIEM solution, and found only pycef. I'm not sure if pycef has the same capabilities as the parsers on the other systems you mentioned.
Thanks for that. I will give that a try, and advise the outcome.
If I had a choice, I would be using something like JSON, however, due to most SIEM solutions coming from similar origins, they generally use a regex statement to parse logs into a standard set of fields. They then apply normalisation to the data (eg. associate IP address and name to defined hosts, consistent formatting, etc), as well as classifying the log (log on event, account lockout, malware detected), so that the data is treated equally for correlation.
Generally, when the vendors integrate with an API, they will take in the JSON format, and then output it to something they control for consistency.
Regards,
John Berkers :: Senior Security Engineer IPSec Pty Ltd
I have forked the repo, and applied both your suggested fix, as well as adding a "sorted()" function to the keyvalue and cef output functions.
Initial tests for both formats are promising in that the data fields are ordered as expected.
@jberkers42, it would be great if you could contribute the changes from your forked repo back to this repo.
I would have thought that after 4 years the divergence of code would present an issue.
I can try, but it was essentially a one or two line change. I have not used this for going on 2 years.
@jberkers42, appreciate it. We are coming up with a roadmap to make a series of incremental improvements to this tool and would like to close out old issues.
I am attempting to writing parsing rules for a LogRhythm SIEM, however, am faced with the challenge that the data is not in a consistent order when using either the CEF or KeyValue output formats.
Since the parsing rules for several SIEM solutions are order-dependent, is it possible to force either a manual order, or sort the fields in a particular way, prior to outputting them to CEF or KeyValue format?
I understand that the underlying issue stems from the fact that JSON is non-pedantic about field ordering, and flattening the JSON just outputs it in whatever order the JSON data has it. Since this order changes from time to time, this is resulting in what I am seeing.
I have provided some sample logs below in KeyValue format.
UPDATING logs at different times
`2017-07-28T00:05:59.114Z rt="2017-07-28T00:05:59.114Z"; end="2017-07-28T00:05:59.101Z"; severity="low"; duid="duid"; whitelist_properties="{}"; dhost="host-a"; endpoint_type="computer"; endpoint_id="endpoint_id"; suser="user 1"; group="UPDATING"; customer_id="customer_id"; type="Event::Endpoint::UpdateSuccess"; id="id"; name="Update succeeded";
2017-08-06T10:27:42.481Z rt="2017-08-06T10:27:42.481Z"; group="UPDATING"; name="Update succeeded"; whitelist_properties="{}"; dhost="host-b"; endpoint_type="server"; endpoint_id="endpoint_id"; suser="n/a"; end="2017-08-06T10:27:42.474Z"; customer_id="customer_id"; type="Event::Endpoint::UpdateSuccess"; id="id"; severity="low"; `
PERIPHERAL logs at different times
`2017-08-06T23:15:26.039Z rt="2017-08-06T23:15:26.039Z"; end="2017-08-06T23:15:26.039Z"; name="Peripheral allowed: SAMSUNG Mobile USB Modem #2"; duid="duid"; whitelist_properties="{}"; dhost="host-c"; endpoint_type="computer"; endpoint_id="endpoint_id"; suser="user a"; group="PERIPHERALS"; customer_id="customer_id"; type="Event::Endpoint::Device::AlertedOnly"; id="id"; severity="low";
2017-07-27T22:56:41.855Z rt="2017-07-27T22:56:41.855Z"; end="2017-07-27T22:56:41.855Z"; severity="low"; duid="duid"; whitelist_properties="{}"; dhost="host-d"; endpoint_type="computer"; endpoint_id="endpoint_id"; suser="user b"; group="PERIPHERALS"; customer_id="customer_id"; type="Event::Endpoint::Device::AlertedOnly"; id="id"; name="Peripheral allowed: WD My Passport 0730 USB Device"; `