mandiant / flare-floss

FLARE Obfuscated String Solver - Automatically extract obfuscated strings from malware.
Apache License 2.0
3.18k stars 446 forks source link

build a database of known junk code strings #773

Open williballenthin opened 1 year ago

williballenthin commented 1 year ago

if our code recovery solution (lancelot or vivisect) fails to identify some code, then we may still display some junk strings that are actually instructions, like

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .text ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃
It<Iu4P                                                             000099c3                  ┃┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ .rdata ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫┃

its likely that this particular instruction sequence is not difficult to recover generally; rather, in this one sample the code analyzer lost trace of a function. therefore, if we recovered code ranges on a large number of programs and matched that up with the strings extracted from the same files, we could build a database of strings that are likely instruction sequences. we could use this database as a fallback to further filter junk strings after the code recovery pass.

williballenthin commented 1 year ago

potential strategy: use lancelot (or similar) to recovery code ranges. mask out all the non-code bytes in the input file. then run strings. any string that is emitted is a junk code string. aggregate, count, and index like normal.

williballenthin commented 1 year ago

then, evaluate this database against what the code recovery solution actually produces for input files: does it even do a better job than doing a disassembly analysis of the input file?

Vulcanraven91 commented 1 year ago

What about something like this https://github.com/ergrelet/windiff but only with generic strings?

williballenthin commented 1 year ago

interesting. do you mean browsing strings from windows binaries across versions? or something else?

Vulcanraven91 commented 1 year ago

Yes, you could then also look up when the string occurred for the first time.

mr-tz commented 1 year ago

Attached are two JSONL files containing strings

While not perfect, it's an easy approximation (with some obvious FPs).

The files are split into strings of length

  1. 4-5 characters and
  2. 6 or more characters

This is only due to an extraction approach I took earlier.

text_section_strings.zip

mr-tz commented 1 year ago

163ca35 adds a junk code strings database and applies the tag #code-junk for now to compare. Not perfect, but can still help:

2023-06-07_15-11-12_WindowsTerminal

Note that the minimum string length here is 4 instead of 6.

mr-tz commented 1 year ago

Above and below are using 480ca51ba24be6f3ad72ce5282b28783. Below with min_len = 6. Using a wider set (VT etc.) of samples hopefully provides better results.

Before

2023-06-07_15-34-23_WindowsTerminal

After

2023-06-07_15-32-01_WindowsTerminal