hashlookup / hashlookup-forensic-analyser

Analyse a forensic target (such as a directory) to find and report files found and not found from CIRCL hashlookup public service - https://circl.lu/services/hashlookup/
https://hashlookup.github.io/hashlookup-forensic-analyser/
Other
121 stars 13 forks source link

Stream processing and cached/async lookups? #3

Closed sthagen closed 2 years ago

sthagen commented 2 years ago

I wonder if this script might be enhanced for use cases handling some or all of:

@adulau: In case there is interest I am happy to provide a minimally invasive pull request (have to implement before ... of course).

Questions:

  1. Which python version is targeted as bottom? Is it 3.6+ or 3.8 or ...? I do find indicators in the source but not declaration ... I assume it is a version every analyst has on their OS but that I do not know :wink:
  2. Could one use a bulk query endpoint as per "Bulk search of SHA-1 hashes"?
  3. Is this binary linux version compiled from the python source per nuitka, pythran et al. or is this build from some other source language hosted elsewhere?

If it is compiled from python source I would need to know how, so I can provide a compilable version directly and no reviewer needs to hint me at all things breaking in that compilation.

adulau commented 2 years ago

Yep it's in my todo list especially the caching to restart a stopped analysis. I'm also thinking of a two-passes mode where you have a first hashing pass of the files and then another option to do the lookup (and also use the bulk-lookup) from the generated file. I'll add it in 0.3.

Concerning the version, I would say Python 3.8 (as the EOL is in 2 months for 3.6) but it should work as is on 3.6.

Regarding the binary, it's currently PyInstaller as the goal would be to produce binaries for Linux, MacOS and Windows.

sthagen commented 2 years ago

Unfortunately the python version is still often a marketing relevant question instead of simply following one or two versions behind the 3.x.0 version … 3.8 would be a very good target in my book. The earliest python major version I can operate on my apple M1 hardware is a 3.7.12 …

Much work of walking the file tree is done by the operating system, and I know of companies that boost their databases by spidering over the columnar file storage.

Thank you for sharing these roadmap aspects and the generation scheme for the binary, much appreciated.

adulau commented 2 years ago

Caching is now implemented 362c4c0ae9097d833f75be69a4008a2eb412c397 with --cache option. It's disabled by default.

adulau commented 2 years ago

Bloom filter is now implemented to allow fast lookup without Internet access.