mandiant / flare-floss

FLARE Obfuscated String Solver - Automatically extract obfuscated strings from malware.
Apache License 2.0
3.17k stars 445 forks source link

qs: automate the construction of databases #796

Open williballenthin opened 1 year ago

williballenthin commented 1 year ago

via #761 and @r0ny123

For example, to build the expert db, we can use GitHub CI, to automatically add the strings from capa rules whenever a rule with a string is added/updated in the capa rules repo.

williballenthin commented 1 year ago

for the #common database, this took many hours to build: a dozen hours to fetch the samples from VT, a few hours to extract strings, a few hours to index the results. im not sure this would fit within our GH Actions limits. im also not sure how frequently this data is likely to change, though its certainly worth investigating.

williballenthin commented 1 year ago

the #expert database is pre-populated with strings from capa rules; however, this was honestly just a shortcut to get something in there. we would like the #expert database to be something that is super easy for users to update and contribute back, such as with a small TUI program or github PR.

i think actually there are many bad entries in the database today from capa, things like "kernel32.dll" etc. So, im hesitate to keep pulling these strings from capa automatically. maybe we can tag update to capa-rules with followup actions to manually update the #expert database when a good string is found?

r0ny123 commented 1 year ago

for the #common database, this took many hours to build: a dozen hours to fetch the samples from VT, a few hours to extract strings, a few hours to index the results. im not sure this would fit within our GH Actions limits. im also not sure how frequently this data is likely to change, though its certainly worth investigating.

We can fetch that info from VT weekly/monthly basis. and regarding the GitHub action limit we can leverage some cloud platforms like AWS etc. Actually, I like the idea how OALabs/hashdb leveraging that.

209026245-1686e6fe-0130-44c7-a04e-4f7d3b77b684

maybe we can tag update to capa-rules with followup actions to manually update the #expert database when a good string is found?

This is a good idea!