Introduce tags for strings

danielplohmann commented 10 months ago

Add tags to strings to give them semantic context. A taxonomy could encompass for example:

winapi: strings that are associated with Windows DLL files or WinAPI names
benign / library: strings that are found in benign software and/or libraries, like deflate etc.
compiler: strings that are introduced as metadata by compilers

williballenthin commented 10 months ago

@danielplohmann I'd invite you to take a peek at our research called QUANTUMSTRAND:

QUANTUMSTRAND is an experiment that augments traditional strings.exe output with context to aid in malware analysis and reverse engineering. For example, we show the structure of a file alongside its strings and mute/highlight entries based on their global prevalence, library association, expert rules, and more. If the experiment proves successful, we'll merge the best features directly into FLOSS and deprecate the QUANTUMSTRAND codename.

Notably, we've been collecting databases of "tags" for strings that we find in executable files, including:

globally prevalent ("common") strings
junk strings (likely from code)
Windows API names
strings from popular OSS
"expert strings" and/or "expert string patterns" (not very complete yet)

Perhaps we could collaborate on these ideas and/or database contents?

danielplohmann commented 10 months ago

Hey @williballenthin, thanks so much for this pointer to your research! That's basically exactly what I was aiming for and have already been started to work on. It definitely makes sense to join forces here, so sure, happy to collaborate on this!

I'm currently already reprocessing Malpedia with floss-3.0.1 and have also included a selection of benign code for which strings are extracted on the fly. I will check the results from the benign binaries against your databases of known strings and provide you with a diff of any meaningful additions I can identify. I've also improved my error handling so I can provide with a list of common errors encountered while flossing across Malpedia.

Today, I've used my ApiScout DbBuilder to parse all of the DLLs found on Win10 and diffed that against your API collection with the following results (qs_api_additions.zip):

 120681 qs_api_additions_full.txt
   8365 qs_api_additions.txt

Depending on if you want the "full" or just "common" set (based on prior research around WinAPI usage frequency) that's a couple thousand WinAPI functions in additions already.

Further ideas for tagging that I had (a bunch possibly identifiable by regex as well) are

file extensions (as regularly found in inclusion/exclusion lists of ransomware)
user agents
registry paths
email addresses, URLs, URIs, IP addresses
system paths, with special treatment for PDB
common .NET function names (as complement to classic WinAPI)
UUIDs, SIDs, common cryptographic hashes, Base64
potential debug messages, prefixed with sequences like [+], [-] etc.
LOLBAS names

Do you have a list of ideas for entities somewhere as well that could be joined?

williballenthin commented 10 months ago

Depending on if you want the "full" or just "common" set (based on prior research around WinAPI usage frequency) that's a couple thousand WinAPI functions in additions already.

Thank you! We'll add these to the databases for even greater coverage.

Do you have a list of ideas for entities somewhere as well that could be joined?

Nothing more thorough than you listed here.

We imagined that analysts might be able to contribute regular expressions alongside some metadata (comments, tags, etc.) to be rendered nicely. Like "this is the OpenSSL version string" or "if you see this, panic!". I guess it's yet another rule format, but meant for extensibility of tools, not detection.

Personally, these various databases seem pretty useful, but I'm not quite sure when and how they'll see action. Maybe QS takes off. Or maybe we'll do an IDA plugin. Thoughts?

danielplohmann commented 10 months ago

Okay, sounds good!

With respect to the tags I listed above, I went ahead and created some simple heuristics to apply them to the strings. Some are based on your string data bases, others originate from regexes that I found or own ideas and string DBs that I created. The statistics for the second iteration of my processing look like this:

            go:  536389
        dotnet:  534333
           lib:  529743
         mingw:  122327
          rust:   80218
           hex:   30617
        winapi:   22159
          msvc:    8468
          path:    8430
file-extension:    6771
          junk:    5834
        sha256:    3070
          sha1:    2949
           url:    2586
        common:    2294
           sql:    1789
         email:    1430
      registry:    1270
          ipv4:    1181
       dbg_msg:    1058
           md5:     896
    user-agent:     856
        lolbas:     752
           pdb:     642
          ipv6:     367
          uuid:     232
   language-id:     229
        sha224:      91
        sha512:      90
        sha384:      30
    pe-section:      19
    powershell:       9
           sid:       4

go/dotnet/rust tags are applied when the string is found in a binary where one of these FLOSS languages processors was used.
lib/mingw/msvc are applied when they are found in any of the compiler ground truth that I added.

I created string DBs for file-extensions, LOLBAS, language-ids (like en-US and all of these), and default pe-sections that you can adopt if you are interested:

string_dbs.zip

For operationalization, I definitely had similar thoughts, especially for an IDA plugin as a demo use case. So I went ahead and wrote a simple one that I pushed with the second iteration of the data set. Filtering/highlighting by occurrence frequency seems to work nicely and is a definitive improvements over the stock IDA strings.

The key challenge seems to be filtering out trash strings on which I will spend a bit more time, I guess...

williballenthin commented 8 months ago

@danielplohmann @mr-tz

What if we took a large number of capa runs and joined "string references by function" with "capa matches by function" so that we could say "this string is often associated with... [DNS resolution or whatever]"?

(and we could do the reverse: "this capa rule is often associated with the strings...")

malpedia / malpedia-flossed

Introduce tags for strings #1