log2timeline / plaso

Super timeline all the things
https://plaso.readthedocs.io
Apache License 2.0
1.73k stars 351 forks source link

FR: Have ElasticSearch output module export filename for all so it can be searched #3172

Closed davidrudduck closed 3 years ago

davidrudduck commented 4 years ago

Description of request: Prior to 20200717, if we were investigating an incident and wanting to deep dive on a particular users artifacts/events, we would use an elasticsearch query like "filename: username" or "filename: username*" to filter down to a particular folder path like C:\Users\username" or "C:\Users\username.domain"

As this field is no longer being populated this is not possible - and the path_spec field is being populated via a JSON value instead of being broken out into sub fields, which is not as easily searchable.

If the data in path_spec was broken out into sub fields (path_spec.type, path_spec.location, etc) instead of being lumped in one field as JSON it would re-enable the ability to filter down based on the path of the artefact(s) we want to show events on.

Plaso version:

For example 20200717

Operating system Plaso is running on:

Ubuntu 18.04 on WSLv1

Installation method: via PPA

Thoughts / suggestions

The alternative workaround for this would be requiring ElasticSearch pipelines to be built to break up the JSON values in the pathspec field; or alternatively just break pathspec out into pathspec.type, pathspec.location and pathspec.indicator rather than dropping the entire JSON into a single value which then needs to be processed to work properly

davidrudduck commented 4 years ago

To explain the workflow, when we are doing windows forensics, we will often start by trying to find the login events (typically by filtering down to windows event logs and the event id's relevant to logon events).

Once we've identified the accounts that we believe have been compromised, and the logon/logoff times of those sessions, we will then filter down on the artefacts that come from that users folder

so if "john.domain" is the user that we believe was compromised, we will run a query in elasticsearch to show all artifacts below C:\Users\john.domain for the period of the session.

Previously we used the filename field to do this by just using the query "filename: john.domain" (or sometimes "filename: Users AND filename: john*".

We are unable to use this type of query on path_spec due to the JSON formatting of values stored within it. Lucene and KQL returns zero results.

davidrudduck commented 4 years ago

Currently "path_sec" gets populated with JSON in ElasticSearch like such:

{"__type__": "PathSpec", "location": "/mnt/e/Cases/CaseName/Plaso/backupserver.body", "type_indicator": "OS"}

If I were building an ingestion pipeline for ElasticSearch, I would look at this JSON and break it down into it's individual elements.

Therefore the "type" field (dropping all non alphanumeric characters) would become "path_spec.type", with the value "PathSpec" (seems redundant mind you since that seems to be the value for most??); "location" would push into "path_spec.location" in elastic, and "type_indicator" would become "path_spec.type_indicator".

As a general rule, any field that has nested JSON in it should be broken out in a similar way where field becomes "field [.] subfield_name"

joachimmetz commented 4 years ago

Related https://github.com/log2timeline/plaso/issues/2940

joachimmetz commented 4 years ago

@davidrudduck thanks for the write up. Per chat I'll have a look at this as part of #2940, since for ES users there seems to be a need to be able to control output fields.

joachimmetz commented 4 years ago

One easy to implement option could be to add the display_name field to the Elasticsearch output.

A next step would be to change the Elasticsearch output module to support to set field names like dynamic. Need to define the "magical" (pre-defined) field name some where and option like '*' for all container attribute names.

davidrudduck commented 4 years ago

Would it suffice to say that, the field names just become a dotted version of the parent field that otherwise stored it?

So even if you are breaking out an XML based field, if the original field is "xml_string", the dynamic fields resulting from the XML are "xml_string.userid" and "xml_string.username" etc?

In the case of path_spec the same would apply and just get a dotted version of the original field, so the JSON remains in path_spec, but it's JSON entry is broken out into sub fields path_spec.type, etc.?

joachimmetz commented 4 years ago

So even if you are breaking out an XML based field, if the original field is "xml_string", the dynamic fields resulting from the XML are "xml_string.userid" and "xml_string.username" etc?

Yes, additionally generated fields would need to have a name space to prevent collisions with existing event data field names. I was thinking evtx or equiv.

In the case of path_spec the same would apply and just get a dotted version of the original field, so the JSON remains in path_spec, but it's JSON entry is broken out into sub fields path_spec.type, etc.?

Not sure yet, I think for your use case display_name could suffice. So display_name is a field that exists in other output formats. It's the "path" you see in the output of log2timeline.py/psteal.py. It's a combination of the file system path and the the parent path specification, e.g. VSS1:C\:Windows

Otherwise an fs name space could help, eg. fs:path or event_source:path (context: https://plaso.readthedocs.io/en/latest/sources/user/Scribbles-about-events.html)

joachimmetz commented 3 years ago

Changes add option to allow user to select additional fields in ES output https://github.com/log2timeline/plaso/pull/3463

joachimmetz commented 3 years ago

Closing this issue, recent changes should cover basic needs and #2940 to further extend field formatting.