How do I exclude files with specific file extension?

welkomier commented 3 years ago

Description of problem:

I tried to exclude files with certain extensions such as e.g. .class from being collected. I therefore created a yaml-file and added it as filter. Log2timeline does properly include the filter-file. However when run, the files are not excluded.

Why are the filters not working? I would expect all files ending with .class to be excluded.

My suggestion is that something goes wrong here: https://github.com/log2timeline/plaso/blob/5972f9b38bb364879328130e4b95c21d36b4ffb0/plaso/engine/worker.py#L869-L874

The worker calls the dfvfs' CompareLocation-method, which for some reason starts to compare each segment of the path to the file with the regex pattern.

https://github.com/log2timeline/dfvfs/blob/5e3c089de915d6db981771232c10a53af12b6877/dfvfs/helpers/file_system_searcher.py#L393-L403

It constructs the location segments from my root directory through which results in: ['home', 'user', 'projects', 'plaso_fork', 'wrapper', 'test_data', 'bad_file.class']. The loop breaks as soon as a missmatch occurs in these lines which is right away!

Could this be due to the fact that i use a wrapper?

Command line and arguments:

I run the below wrapper script for plaso using the followin command line arguments: --storage_file ./dd.plaso --temporary_directory . --logfile ./logfile_dd.log.gz --partitions all --filter-file exclude_files_test.yaml --no_dependencies_check --process_archives --skip_compressed_streams --single_process --debug test_exlusion

Source data:

I created a test directory with the following structure:

test_data
├── bad_file.class
├── bin
│   ├── bad_file_2.class
│   └── good_file_2.txt
├── good_file.txt
└── lib
    ├── bad_file_3.class
    └── good_file_3.txt

And used a filter_file that looked like this:

description: Exclude all Bad File Extensions.
type: exclude
paths:
- '/.+[.]class'

Plaso version:

20210606

Operating system Plaso is running on:

Ubuntu 20.04.2 LTS

Installation method:

Pulled log2timeline/plaso from github and created a wrapper in python in the following form:

import sys

from plaso.cli import log2timeline_tool
from plaso.cli import tools as cli_tools

input_reader = cli_tools.StdinInputReader()
tool = log2timeline_tool.Log2TimelineTool(input_reader=input_reader)

def main(args):
    if not tool.ParseArguments(args):
        return False
    try:
        tool.ExtractEventsFromSources()
    except:
        return False

if __name__ == "__main__":
    args = sys.argv[1:]
    main(args)

Debug output/tracebacks:

The filtered debug log shows that all files are processed
2021-07-07 15:54:30,993 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion
...2021-07-07 15:54:30,993 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:30,995 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion
2021-07-07 15:54:30,997 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin
...2021-07-07 15:54:30,997 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:30,998 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin
2021-07-07 15:54:31,000 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib
...2021-07-07 15:54:31,000 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,002 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib
2021-07-07 15:54:31,003 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bad_file.class
...2021-07-07 15:54:31,003 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,216 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bad_file.class
2021-07-07 15:54:31,217 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/good_file.txt
...2021-07-07 15:54:31,217 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,398 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/good_file.txt
2021-07-07 15:54:31,399 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin/bad_file_2.class
...2021-07-07 15:54:31,400 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,583 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin/bad_file_2.class
2021-07-07 15:54:31,585 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin/good_file_2.txt
...2021-07-07 15:54:31,585 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,767 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/bin/good_file_2.txt
2021-07-07 15:54:31,768 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib/bad_file_3.class
...2021-07-07 15:54:31,769 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:31,950 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib/bad_file_3.class
2021-07-07 15:54:31,951 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib/good_file_3.txt
...2021-07-07 15:54:31,952 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntryDataStream] proce...
2021-07-07 15:54:32,135 [DEBUG] (MainProcess) PID:313944 <worker> [ProcessFileEntry] done processing file entry: OS:/home/treebeard/Projects/plaso_fork/bore_wrapper/test_exlusion/lib/good_file_3.txt

joachimmetz commented 3 years ago

Why are the filters not working? I would expect all files ending with .class to be excluded.

The regexs are on a per-path segment basis, for your example try:

description: Exclude all Bad File Extensions.
type: exclude
paths:
- '/.+[.]class'
- '/.+/.+[.]class'

the reason for this is that treating paths as strings can lead to strange edge cases, especially with data stream names

joachimmetz commented 3 years ago

Per https://github.com/log2timeline/plaso/issues/1537

add path suffixes support
what is the expected behavior here? to only match on leaf file entries (files and empty directories)?

welkomier commented 3 years ago

Thanks very much for the quick reply! I will try this solution asap.

description: Exclude all Bad File Extensions.
type: exclude
paths:
- '/.+[.]class'
- '/.+/.+[.]class'

However wouldnt that mean that for each file extension all possible location depths would need to go into the filter file. Which again would lead to an bloated filter file, right?

joachimmetz commented 3 years ago

Which again would lead to an bloated filter file, right?

yes at the moment that would lead to repeated similar looking filter expressions

having a more granular find spec could be an option here, but that has not been high on the priority list

welkomier commented 3 years ago

Another question related to this: Why does it resolve the path for a normal folder all the way back to /, while it does not for example for an image file?

Its quite hard to create a filter file that works for both cases...

joachimmetz commented 3 years ago

Why does it resolve the path for a normal folder all the way back to /, while it does not for example for an image file?

it is not clear to me what you mean.

if you have a folder you can determine what files are in the folder, why would you need to in addition filter these? Could you describe your workflow here.

welkomier commented 3 years ago

Good question! I try to make plaso work in a forensic context, where the contents of the files to analyze are mostly unknown. My workflow is therefore the following:

I get arbitrary data. Could be a folder, a Zip file or an img-file of some sort.
I pipe it through plaso
I get a readout of all logs, get a timeline and even a dataset containing possible IOCs etc.

I therefore cannot know what kind of data i pipe through plaso. I do, however, know what kind of files i want to ignore (e.g. pyc etc.). Thats why it would be so nice to have a simple way of ignoring these.

joachimmetz commented 3 years ago

Why does it resolve the path for a normal folder all the way back to /, while it does not for example for an image file?

Can you explain what do you mean with this?

I try to make plaso work in a forensic context,

Can you elaborate on what you mean with forensic (or legal) context? Also see: https://en.wikipedia.org/wiki/Forensic_science#Etymology

First of all a file extension is no guarantee for file content (even mime or magic types have shortcomings). Plaso parsers try to detect the file format they support and ignore the file if it does not meet the requirements. To my knowledge there are no typical .pyc, .pyo or .class specific parsers in plaso. So why do you want to filter them on a file system level?

Where do you get the arbitrary data from? End-points, some collection script/system? Where do you store it for Plaso to parse it? How do plan to use the filter files?

welkomier commented 3 years ago

Can you explain what do you mean with this?

If i process a folder with the absolute path /home/user/projects/actual_project/folder that contains the file file.test, while i run my log2timeline wrapper from the directory actual_project, it matches the filters against the absolut path. It therefore needs a filter_path of the form /.+/.+/.+/.+/.+/.+[.]test to match the contained file . Whereas if i process an image file located at /home/user/projects/actual_project/test.img that contains the file structure folder/file.test, the filter /.+/.+[.]test suffices. I was just wondering why that is the case.

Can you elaborate on what you mean with forensic (or legal) context? Where do you get the arbitrary data from? End-points, some collection script/system? Where do you store it for Plaso to parse it? How do plan to use the filter files?

I mean this in a narrow sense of the word and close to its actual meaning. To be precise, digital forensics is the proper name for the use case. The data is brought to me for analysis after incidents such as e.g. ransomware attacks. My primary goal therefore is to get a good overview of the available IOCs such as IP addresses or files with known hashes. The data is therefore not standardized or always in the same form. It can be any type of file or container.

First of all a file extension is no guarantee for file content (even mime or magic types have shortcomings).

That is true. However it suffices for a first impression in most cases.

Plaso parsers try to detect the file format they support and ignore the file if it does not meet the requirements. To my knowledge there are no typical .pyc, .pyo or .class specific parsers in plaso. So why do you want to filter them on a file system level?

Very good point. I was hoping to speed up the process by ignoring the files i'm most likely not interested in. I explore how log2timeline handles files it doesnt have a parser for though. Does it just ignore them or does it try to actually open them and then fails? Looking at the log files it seems that the latter is the case. Or am i wrong?

joachimmetz commented 3 years ago

I was just wondering why that is the case.

I would need to double check but if I recall correctly the base path /home/user/projects/actual_project/folder is passed as "mount point" to the dfVFS file system searcher and therefore excluded from the find spec search.

I explore how log2timeline handles files it doesnt have a parser for though.

The file is opened to determine if the content can be processed by a parser and if filestat is enabled, also file system metadata is extracted.

An approach could be to do some preprocessing e.g. a find path -name \*.class to generate an exclusion filter file.

welkomier commented 3 years ago

Very good idea! Thanks for your inputs.

log2timeline / plaso

How do I exclude files with specific file extension? #3796