honeynet / cuckooml

CuckooML: Machine Learning for Cuckoo Sandbox
https://honeynet.github.io/cuckooml/
146 stars 52 forks source link

Resolving abbreviated malware names #9

Open So-Cool opened 8 years ago

So-Cool commented 8 years ago

Right now the first mapping which is the longest string matched is used. To improve labelling all possible matches need to be considered and the most probable abbreviation combination i.e. the one that uses all of the sub-strings should be chosen. For example "adload" right now will be split into "a" and "dload" with the latter mapped to downloader. A better split would be "ad" (adware) and "load" (downloader).

hgascon commented 8 years ago

How often does this occur? If there are not too many cases, such mappings can be added manually.

So-Cool commented 8 years ago

Not too often in the samples that I have to be honest. Nevertheless, as there is quite a number of possible combinations this could be quite useful in general. Let's see what happens with labels when we're at the stage of clustering.