gp: collect string global prevalence data

mandiant / flare-floss

FLARE Obfuscated String Solver - Automatically extract obfuscated strings from malware.

Apache License 2.0

3.24k stars 452 forks source link

gp: collect string global prevalence data #713

Open williballenthin opened 1 year ago

williballenthin commented 1 year ago

### Tasks
- [ ] gp: collect files from windows xp image
- [ ] gp: collect files from windows 7 image
- [ ] gp: collect files from windows 11 image
- [ ] gp: collect files from ubuntu image
- [ ] gp: collect files from macOS image
- [ ] gp: collect files from cygwin
- [ ] https://github.com/mandiant/flare-floss/issues/722

mr-tz commented 1 year ago

There may be value in including meta data (like offsets and associated section) for data that we generate from scratch.

So instead of knowing "stringX is common", we would know "stringX is common at offsetY/in sectionZ".

williballenthin commented 1 year ago

the database should be built from a "representative" set of PE files, whatever that means. we should try to include a bunch of legit and malware files. we don't want to over-emphasize malware such that malicious strings somehow look benign, though we hope for enough diversity among malware to avoid this naturally.

some data sources to consider:

one day/week's worth of VT uploads
Windows OS images
files referenced by the NSRL (or subset)

targeting 1m input files seems nice and round and big enough.

williballenthin commented 1 year ago

for NSRL files, we can use the published diskprints https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-subprojects/diskprints and extract hashes of .exe and .dll file paths and fetch those from VT/etc. and extract hashes.

however, it seems only windows xp and windows 7 are included here. so we should also pull files from a win11 VM, too.

williballenthin commented 1 year ago

we should pull from files that we expect to use non-msvc compilers, too, such as cygwin. and maybe pull from non-windows systems, like linux and macos, to get gcc and llvm toolchain coverage.

williballenthin commented 1 year ago

for windows files we can enumerate and acquire them via here: https://m417z.com/Introducing-Winbindex-the-Windows-Binaries-Index/

c-urly commented 7 months ago

If we want to add ELF,MACH-O layout support in QS we will also need to create database with elf binaries/MACH-O and malwares correct? Or do we already have tag rules for tagging ELF and MACH-O strings?

williballenthin commented 7 months ago

Strictly speaking, no. Layout and prevalence are disjoint topics/tags.

But, I do think the databases would be much more useful with this information. While there may be some overlap in strings found across formats, like OpenSSL strings, there should also be some differences, like MSVC vs GNU link etc.