Open danielplohmann opened 10 months ago
@danielplohmann I'd invite you to take a peek at our research called QUANTUMSTRAND:
QUANTUMSTRAND is an experiment that augments traditional strings.exe output with context to aid in malware analysis and reverse engineering. For example, we show the structure of a file alongside its strings and mute/highlight entries based on their global prevalence, library association, expert rules, and more. If the experiment proves successful, we'll merge the best features directly into FLOSS and deprecate the QUANTUMSTRAND codename.
Notably, we've been collecting databases of "tags" for strings that we find in executable files, including:
Perhaps we could collaborate on these ideas and/or database contents?
Hey @williballenthin, thanks so much for this pointer to your research! That's basically exactly what I was aiming for and have already been started to work on. It definitely makes sense to join forces here, so sure, happy to collaborate on this!
I'm currently already reprocessing Malpedia with floss-3.0.1 and have also included a selection of benign code for which strings are extracted on the fly. I will check the results from the benign binaries against your databases of known strings and provide you with a diff of any meaningful additions I can identify. I've also improved my error handling so I can provide with a list of common errors encountered while flossing across Malpedia.
Today, I've used my ApiScout DbBuilder to parse all of the DLLs found on Win10 and diffed that against your API collection with the following results (qs_api_additions.zip):
120681 qs_api_additions_full.txt
8365 qs_api_additions.txt
Depending on if you want the "full" or just "common" set (based on prior research around WinAPI usage frequency) that's a couple thousand WinAPI functions in additions already.
Further ideas for tagging that I had (a bunch possibly identifiable by regex as well) are
[+]
, [-]
etc.Do you have a list of ideas for entities somewhere as well that could be joined?
Depending on if you want the "full" or just "common" set (based on prior research around WinAPI usage frequency) that's a couple thousand WinAPI functions in additions already.
Thank you! We'll add these to the databases for even greater coverage.
Do you have a list of ideas for entities somewhere as well that could be joined?
Nothing more thorough than you listed here.
We imagined that analysts might be able to contribute regular expressions alongside some metadata (comments, tags, etc.) to be rendered nicely. Like "this is the OpenSSL version string" or "if you see this, panic!". I guess it's yet another rule format, but meant for extensibility of tools, not detection.
Personally, these various databases seem pretty useful, but I'm not quite sure when and how they'll see action. Maybe QS takes off. Or maybe we'll do an IDA plugin. Thoughts?
Okay, sounds good!
With respect to the tags I listed above, I went ahead and created some simple heuristics to apply them to the strings. Some are based on your string data bases, others originate from regexes that I found or own ideas and string DBs that I created. The statistics for the second iteration of my processing look like this:
go: 536389
dotnet: 534333
lib: 529743
mingw: 122327
rust: 80218
hex: 30617
winapi: 22159
msvc: 8468
path: 8430
file-extension: 6771
junk: 5834
sha256: 3070
sha1: 2949
url: 2586
common: 2294
sql: 1789
email: 1430
registry: 1270
ipv4: 1181
dbg_msg: 1058
md5: 896
user-agent: 856
lolbas: 752
pdb: 642
ipv6: 367
uuid: 232
language-id: 229
sha224: 91
sha512: 90
sha384: 30
pe-section: 19
powershell: 9
sid: 4
I created string DBs for file-extensions, LOLBAS, language-ids (like en-US
and all of these), and default pe-sections that you can adopt if you are interested:
For operationalization, I definitely had similar thoughts, especially for an IDA plugin as a demo use case. So I went ahead and wrote a simple one that I pushed with the second iteration of the data set. Filtering/highlighting by occurrence frequency seems to work nicely and is a definitive improvements over the stock IDA strings.
The key challenge seems to be filtering out trash strings on which I will spend a bit more time, I guess...
@danielplohmann @mr-tz
What if we took a large number of capa runs and joined "string references by function" with "capa matches by function" so that we could say "this string is often associated with... [DNS resolution or whatever]"?
(and we could do the reverse: "this capa rule is often associated with the strings...")
Add tags to strings to give them semantic context. A taxonomy could encompass for example:
winapi
: strings that are associated with Windows DLL files or WinAPI namesbenign
/library
: strings that are found in benign software and/or libraries, like deflate etc.compiler
: strings that are introduced as metadata by compilers