Add --metadata-output flag to include metadata for generated jsonl/csv records

volatilityfoundation / volatility3

Volatility 3.0 development

http://volatilityfoundation.org/

Other

2.72k stars 463 forks source link

Add --metadata-output flag to include metadata for generated jsonl/csv records #1357

Open JSCU-CNI opened 1 week ago

JSCU-CNI commented 1 week ago

In this pull request, we added the flag --metadata-output which adds metadata to each record that is generated by a plugin. Currently just two pieces of metadata are added: (1) the location of the primary layer/image ('this is the artefact from which I generated this record') and (2) the plugin name that generated the record.

The metadata is prefixed with two underscores. Example output would be:

vol3 -f image.raw --metadata-output -r jsonl linux.psaux
{"ARGS": "/sbin/init", [...], "__image_path": "<path to image>", "__plugin": "linux.psaux"}
[...]

In the future, one could imagine more metadata being added to each record. For example: the datetime of when the record was generated, the SHA256/MD5/SHA1 hashes of the primary layer/image, and the version of Volatility3 that generated the record. We did not add these fields in this pull request, to first gauge your reaction to adding metadata. If you like the idea, we can add the remaining fields as well (in a subsequent PR).

One could use this flag when storing large(r) amount of Volatility3 output for postprocessing. I.e. if one would fire all possible plugins on a single file and appends all jsonl/csv output to a single file, the __plugin field can then be used to postprocess the generated records.

ikelos commented 1 week ago

Thanks very much for your contribution. This pull request would differentiate the output from JSONL to all the other forms of output (notably the standard JSON or CSV) and it would only do it for the CLI text renderer, and not any other output tool that could make use of the framework. It also feels as though it could be achieved through a script that the output is piped through, rather than needing to change the program itself? Deciding what additional data to include, and then allowing people not to include it (because someone would complain that it's unnecessary or too noisy) and it starts to become a lot of trouble for a very specific use-case.

From you recent bug reports, it sounds as though you might be best off writing a custom UI with the specific requirements you want for the tasks you're trying to do? We'd be happy to help advise on those if that would help?

atcuno commented 1 week ago

@ikelos the significant issue trying to be addressed here is that our jsonl records are unusable in the common situation where people will just want to send Volatility output into systems like Elastic, Splunk, or other common ingestion APIs.

The non-usability comes from the fact that the jsonl records do not have standard columns (obviously since they are plugin specific), but then on top of that, analysts can't even customize their own system to ingest the output since the columns all change and nothing specifies which plugin produced the record.

If having this as an option available to the standard jsonl isn't possible, can the PR be changed to add a new renderer type (-r jsonl-metadata or similar) that fits these requirements?

Forcing people to write their own external scripts to be piped out doesn't make sense to me when many of our users across many organizations will be using Volatility only in this manner / for this purpose. On the personal opinion side, I also wouldn't want to have to explain the need for a custom script during trainings as it would make me/other people teaching sound like amateurs since the ingestion use case is an every day workflow in enterprise settings.

ikelos commented 1 week ago

the significant issue trying to be addressed here is that our jsonl records are unusable in the common situation where people will just want to send Volatility output into systems like Elastic, Splunk, or other common ingestion APIs.

I feel there should be a different tool for that specific purpose, particularly if it's as ubiquitous as you put forward here. This is like saying the cheesegrater you bought is unsuitable for mowing the lawn. The CLI is a general tool for accessing the volatility framework, if people want to write something that sends the data to splunk/elastic/whatever, then they should write a separate tool to do that. They can then configure it to contain whatever data it is they want. They can control the inputs, the outputs and even build proper pipe-lining for it. The framework was designed as a library specifically to allow people to do exactly this. The CLI was designed to be a simple frontend to give people access to the framework, which is what it does.

This plugin also makes certain simplifying assumptions that do not fit with the framework in general (such as that there's only one memory image input, etc). It doesn't include any mention of the swap space? Surely for completeness it should contain a copy of the config, which would allow recreating the image for future testing and ensure that multiple images/swap space are properly supported? My point is, adding "just this one thing" is a slippery slope, it's not that easy to do without some careful consideration and at some point people need to start writing their own tools.

I would be more open to a specific other renderer (particularly if it were a drop-in/plugin so that it wasn't necessarily part of the core) that outputs metadata, but I'm not sure the renderers are currently handed all the data people might want them to be given and I'm not intending to pass through all possible information just because someone might want the renderer to support a single specific use-case. The only significant change I'm intending to make is to extend the renderer design with the context (so that renderers can gather their own data from the images to support the Data output format), but that's about it and that's still in the works/needs some thinking about. That would at least allow you access to the config if you knew what values you were looking for...

atcuno commented 6 days ago

First, I am pretty certain you are mixing up features between PRs in your reply. All this PR here wants to do is get the image path and the plugin name into rows that are output. This isn't the other ticket about plugins accessing config data for truncating strings etc.

For example, this part of your reply when this PR isn't a new plugin: "This plugin also makes certain simplifying assumptions that do"

For this one, the workflow fits 100% into what custom renderers classes are for. We already support outputs for ingestion into other tools, which is the whole purpose of the existing json, jsonl, and CSV renderers. So adding support for another common DFIR format - jsonl, but with the context added to have custom post processing is what the custom render classes are for.

This part of your reply is not a concern for what would be this custom renderer:

"but I'm not sure the renderers are currently handed all the data people might want them to be given"

The current renderers produce the rows that are needed fine for this type of work, all this is needed is making the extra columns available to it - the file path and plugin. The use case here is not testing to ever go back and run plugns again, so the whole configuration would not be helpful or ever used. The purpose is just to get Volatility to create the data that can then be ingested into Splunk/ES/etc. This is exactly what the renderers are for - to produce a certain format - like the existing ones.

So with this clarification, is there still a reason we can't have a custom JSON renderer class that adds these extra fields because to me this is 100% what the custom renderers are for and this fits the most common use case of Volatility 3 for enterprise scale investigations, so it makes 100% sense to have it as a feature of the framework.

ikelos commented 6 days ago

I think you misunderstood me, I didn't mean to say plugin, but no other issue entered my mind when responding to this one. You didn't appear to address any of the concerns I raised:

This doesn't consider the best way to pass the information through to the renderers, it just bulldozers them into the code
This seems incomplete, because it only deals specifically with two additional items of information, meaning there may be more requests in the future for something that would be useful, and this would then be a template for bull-dozering them in the same way if it were allowed
There's no real thought about all the possibilities within the framework, just a single use-case, so specifically this fails to meaningfully deal with multiple input layers or swap space
Renderers are designed to output the information provided by a plugin in different common formats, it was not the original design to make them provide all data in whatever format was needed. Characterizing them that way is wish fulfilment but not reality. Volatility will not ever provide everything everyone wants from its output. People should use the best tool for the job, and it may not always by the CLI (even if it would be convenient) or even exist yet.

One of the reasons I'm asked to review these requests is to gate-keep the code, with a view of the whole framework and everything it can do, and make sure the code doesn't turn into spaghetti. That means sometimes making decisions like this. Your last paragraph was a personal plea, but essentially said you don't understand my position. Hopefully I've clarified it here, but I'm happy to drill further into any specific points you may still not understand (although I'd prefer not to do it with points you simply disagree with).

The best option for this is to wait for the config to be passed through to renderers (which it will do to allow the Data type to be rendered with UI options for how much context to include, etc), and then write a separate self-contained renderer that reads the wanted elements from the config and adds it to the output. That way, even if it doesn't make it into core, it'll exist and people can use it. It should also make the authors consider the difficulties of where they acquire the information they want to output and how accurate/helpful it will actually be given all possible plugins. If it turns out to be massively popular we can see about ensuring it copes with those situation and possibly adding it to core.

atcuno commented 6 days ago

Ok, so if renderers will soon have access to this information through the Data updates, then what is the big concern over it being in the core repo? Just because people might want more fields in the future? In that case, we can just update the renderer if something beyond sample name and plugin name become critical.

ikelos commented 6 days ago

I've given you my list of concerns, which you haven't addressed. Instead you've repeated that you can't understand why I'm not blindly agreeing to your original point. Please, please save us both some time, and take onboard my points, and my experience as a software engineer, that is trying to engineer this software to be maintainable and extensible long-term, and accept that in my view quick fixes like the one suggested here will lead to code spaghetti. I really don't appreciate such resistance every time I won't provide instant gratification for a feature. We have different skillsets, I try not to question your forensic knowledge, and acquiesce when you tell me something should be a particular way. Could you please consider extending me the same courtesy around software design?

atcuno commented 6 days ago

I was addressing this from your previous reply:

The best option for this is to wait for the config to be passed through to renderers (which it will do to allow the Data type to be rendered with UI options for how much context to include, etc), and then write a separate self-contained renderer that reads the wanted elements from the config and adds it to the output. That way, even if it doesn't make it into core, ...

So I was asking, why, if the renderers will have access to the needed data (sample path + plugin name) after the Data update, is their resistance to the new renderer being contained in the core framework?

he original feature request and related discussion on adding direct access to the plugin name and sample name aren't needed if it is going to come through the Data updates, so I didn't see the point in addressing that. The only remaining issue is why there would be resistance to it going into the core.

ikelos commented 6 days ago

That suggests you've misunderstood the core of my issue with the feature. A single filename returned in the output works for most current plugins, but will not work when people write plugins that work on two memory images or take their data from other layers that don't have a filename (such as a future 1394 layer or similar). It will not show useful data if the same image is provided with different swap files and gives different results. It has not been properly thought out and that is why I don't want to add it to core in its current state.

As mentioned right at the start, there are some simplifying assumptions (all plugins will always be based on a single file image) that would then limit the framework, or not function properly in the future. I'm sorry you don't see the point in addressing these issues and decided to focus on the issue you felt important, but they all need considering to prevent future bug reports and long discussions trying to "solve" the issues we could have foreseen at the time, if only we'd considered them.

JSCU-CNI commented 6 days ago

Maybe a good way forward would be for us to update the PR (or create a new one) with a custom renderer as @atcuno suggested: -r jsonl-metadata. @ikelos would you be open to that?

ikelos commented 6 days ago

I'm not sure how you'd do that when renderers aren't given information such as which plugin has been run or which file was used as the single-location, but by all means give it a try and hopefully you'll run into some of the issues that concern me...

ikelos commented 6 days ago

Also, this can be achieved with the following simple script that uses jq (https://stedolan.github.io/jq/). It does not require extensive reworking of the internals of volatility:

python vol.py -r jsonl -f <filename> <plugin> | jq --arg filename <filename> --arg plugin <plugin> '. + {__image_path: $filename, __plugin: $plugin}'