Open williballenthin opened 1 year ago
There may be value in including meta data (like offsets and associated section) for data that we generate from scratch.
So instead of knowing "stringX is common", we would know "stringX is common at offsetY/in sectionZ".
the database should be built from a "representative" set of PE files, whatever that means. we should try to include a bunch of legit and malware files. we don't want to over-emphasize malware such that malicious strings somehow look benign, though we hope for enough diversity among malware to avoid this naturally.
some data sources to consider:
targeting 1m input files seems nice and round and big enough.
for NSRL files, we can use the published diskprints https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-subprojects/diskprints and extract hashes of .exe and .dll file paths and fetch those from VT/etc. and extract hashes.
however, it seems only windows xp and windows 7 are included here. so we should also pull files from a win11 VM, too.
we should pull from files that we expect to use non-msvc compilers, too, such as cygwin. and maybe pull from non-windows systems, like linux and macos, to get gcc and llvm toolchain coverage.
for windows files we can enumerate and acquire them via here: https://m417z.com/Introducing-Winbindex-the-Windows-Binaries-Index/
If we want to add ELF,MACH-O layout support in QS we will also need to create database with elf binaries/MACH-O and malwares correct? Or do we already have tag rules for tagging ELF and MACH-O strings?
Strictly speaking, no. Layout and prevalence are disjoint topics/tags.
But, I do think the databases would be much more useful with this information. While there may be some overlap in strings found across formats, like OpenSSL strings, there should also be some differences, like MSVC vs GNU link etc.