build a database of known junk code strings

williballenthin commented 1 year ago

if our code recovery solution (lancelot or vivisect) fails to identify some code, then we may still display some junk strings that are actually instructions, like

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .text ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃
It<Iu4P                                                             000099c3                  ┃┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .rdata ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃

its likely that this particular instruction sequence is not difficult to recover generally; rather, in this one sample the code analyzer lost trace of a function. therefore, if we recovered code ranges on a large number of programs and matched that up with the strings extracted from the same files, we could build a database of strings that are likely instruction sequences. we could use this database as a fallback to further filter junk strings after the code recovery pass.

williballenthin commented 1 year ago

potential strategy: use lancelot (or similar) to recovery code ranges. mask out all the non-code bytes in the input file. then run strings. any string that is emitted is a junk code string. aggregate, count, and index like normal.

williballenthin commented 1 year ago

then, evaluate this database against what the code recovery solution actually produces for input files: does it even do a better job than doing a disassembly analysis of the input file?

Vulcanraven91 commented 1 year ago

What about something like this https://github.com/ergrelet/windiff but only with generic strings?

williballenthin commented 1 year ago

interesting. do you mean browsing strings from windows binaries across versions? or something else?

Vulcanraven91 commented 1 year ago

Yes, you could then also look up when the string occurred for the first time.

mr-tz commented 1 year ago

Attached are two JSONL files containing strings

from the .text sections of
thousands of C:\Windows native binaries
occurring 100 times or more

While not perfect, it's an easy approximation (with some obvious FPs).

The files are split into strings of length

4-5 characters and
6 or more characters

This is only due to an extraction approach I took earlier.

text_section_strings.zip

mr-tz commented 1 year ago

163ca35 adds a junk code strings database and applies the tag #code-junk for now to compare. Not perfect, but can still help:

Note that the minimum string length here is 4 instead of 6.

mr-tz commented 1 year ago

Above and below are using 480ca51ba24be6f3ad72ce5282b28783. Below with min_len = 6. Using a wider set (VT etc.) of samples hopefully provides better results.

Before

After

mandiant / flare-floss

build a database of known junk code strings #773