Closed welkomier closed 3 years ago
Why are the filters not working? I would expect all files ending with .class to be excluded.
The regexs are on a per-path segment basis, for your example try:
description: Exclude all Bad File Extensions.
type: exclude
paths:
- '/.+[.]class'
- '/.+/.+[.]class'
the reason for this is that treating paths as strings can lead to strange edge cases, especially with data stream names
Per https://github.com/log2timeline/plaso/issues/1537
add path suffixes support
what is the expected behavior here? to only match on leaf file entries (files and empty directories)?
Thanks very much for the quick reply! I will try this solution asap.
description: Exclude all Bad File Extensions.
type: exclude
paths:
- '/.+[.]class'
- '/.+/.+[.]class'
However wouldnt that mean that for each file extension all possible location depths would need to go into the filter file. Which again would lead to an bloated filter file, right?
Which again would lead to an bloated filter file, right?
yes at the moment that would lead to repeated similar looking filter expressions
having a more granular find spec could be an option here, but that has not been high on the priority list
Another question related to this:
Why does it resolve the path for a normal folder all the way back to /
, while it does not for example for an image file?
Its quite hard to create a filter file that works for both cases...
Why does it resolve the path for a normal folder all the way back to /, while it does not for example for an image file?
it is not clear to me what you mean.
if you have a folder you can determine what files are in the folder, why would you need to in addition filter these? Could you describe your workflow here.
Good question! I try to make plaso work in a forensic context, where the contents of the files to analyze are mostly unknown. My workflow is therefore the following:
I therefore cannot know what kind of data i pipe through plaso. I do, however, know what kind of files i want to ignore (e.g. pyc
etc.). Thats why it would be so nice to have a simple way of ignoring these.
Why does it resolve the path for a normal folder all the way back to /, while it does not for example for an image file?
Can you explain what do you mean with this?
I try to make plaso work in a forensic context,
Can you elaborate on what you mean with forensic (or legal) context? Also see: https://en.wikipedia.org/wiki/Forensic_science#Etymology
First of all a file extension is no guarantee for file content (even mime or magic types have shortcomings). Plaso parsers try to detect the file format they support and ignore the file if it does not meet the requirements. To my knowledge there are no typical .pyc
, .pyo
or .class
specific parsers in plaso. So why do you want to filter them on a file system level?
Where do you get the arbitrary data from? End-points, some collection script/system? Where do you store it for Plaso to parse it? How do plan to use the filter files?
Can you explain what do you mean with this?
If i process a folder with the absolute path /home/user/projects/actual_project/folder
that contains the file file.test
, while i run my log2timeline wrapper from the directory actual_project
, it matches the filters against the absolut path. It therefore needs a filter_path of the form /.+/.+/.+/.+/.+/.+[.]test
to match the contained file . Whereas if i process an image file located at /home/user/projects/actual_project/test.img
that contains the file structure folder/file.test
, the filter /.+/.+[.]test
suffices. I was just wondering why that is the case.
Can you elaborate on what you mean with forensic (or legal) context? Where do you get the arbitrary data from? End-points, some collection script/system? Where do you store it for Plaso to parse it? How do plan to use the filter files?
I mean this in a narrow sense of the word and close to its actual meaning. To be precise, digital forensics is the proper name for the use case. The data is brought to me for analysis after incidents such as e.g. ransomware attacks. My primary goal therefore is to get a good overview of the available IOCs such as IP addresses or files with known hashes. The data is therefore not standardized or always in the same form. It can be any type of file or container.
First of all a file extension is no guarantee for file content (even mime or magic types have shortcomings).
That is true. However it suffices for a first impression in most cases.
Plaso parsers try to detect the file format they support and ignore the file if it does not meet the requirements. To my knowledge there are no typical .pyc, .pyo or .class specific parsers in plaso. So why do you want to filter them on a file system level?
Very good point. I was hoping to speed up the process by ignoring the files i'm most likely not interested in. I explore how log2timeline handles files it doesnt have a parser for though. Does it just ignore them or does it try to actually open them and then fails? Looking at the log files it seems that the latter is the case. Or am i wrong?
I was just wondering why that is the case.
I would need to double check but if I recall correctly the base path /home/user/projects/actual_project/folder
is passed as "mount point" to the dfVFS file system searcher and therefore excluded from the find spec search.
I explore how log2timeline handles files it doesnt have a parser for though.
The file is opened to determine if the content can be processed by a parser and if filestat is enabled, also file system metadata is extracted.
An approach could be to do some preprocessing e.g. a find path -name \*.class
to generate an exclusion filter file.
Very good idea! Thanks for your inputs.
Description of problem:
I tried to exclude files with certain extensions such as e.g.
.class
from being collected. I therefore created a yaml-file and added it as filter. Log2timeline does properly include the filter-file. However when run, the files are not excluded.Why are the filters not working? I would expect all files ending with
.class
to be excluded.My suggestion is that something goes wrong here: https://github.com/log2timeline/plaso/blob/5972f9b38bb364879328130e4b95c21d36b4ffb0/plaso/engine/worker.py#L869-L874
The worker calls the
dfvfs' CompareLocation
-method, which for some reason starts to compare each segment of the path to the file with the regex pattern.https://github.com/log2timeline/dfvfs/blob/5e3c089de915d6db981771232c10a53af12b6877/dfvfs/helpers/file_system_searcher.py#L393-L403
It constructs the location segments from my root directory through which results in:
['home', 'user', 'projects', 'plaso_fork', 'wrapper', 'test_data', 'bad_file.class']
. The loop breaks as soon as a missmatch occurs in these lines which is right away!Could this be due to the fact that i use a wrapper?
Command line and arguments:
I run the below wrapper script for plaso using the followin command line arguments:
--storage_file ./dd.plaso --temporary_directory . --logfile ./logfile_dd.log.gz --partitions all --filter-file exclude_files_test.yaml --no_dependencies_check --process_archives --skip_compressed_streams --single_process --debug test_exlusion
Source data:
I created a test directory with the following structure:
And used a filter_file that looked like this:
Plaso version:
20210606
Operating system Plaso is running on:
Ubuntu 20.04.2 LTS
Installation method:
Pulled
log2timeline/plaso
from github and created a wrapper in python in the following form:Debug output/tracebacks: