Open williballenthin opened 3 years ago
https://github.com/mandiant/capa/pull/1502#issuecomment-1584266328
Entropy Based Approach one approach is to calculate the entropy of rule matches. Entropy, in this context, refers to a measure of the distribution or variability of rule matches across a dataset. By calculating the entropy of each rule, we can determine how commonly or rarely a rule matches within the dataset.
Incorporating entropy as metadata for each rule in the capa report allows users to quickly assess the distribution and variability of rule matches. This information can aid in distinguishing between frequently occurring rule matches that may be less interesting and those that are relatively rare and more noteworthy.
We can employ a rule ranking system , while displaying caps results we can sort the matched rules based on their entropy levels in metadata and display confidence levels.
While entropy provides a quantitative measure of rule variability, it's important to note that it may not capture all aspects of rule significance. Therefore, additional factors or heuristics might be necessary to provide a more comprehensive assessment. Continuous refinement and improvement of the analysis techniques can enhance the precision and conciseness of the entropy-based information in the capa report.
We can divide the datasets based on groups such as "File Operations," "Network Communication," "Process Injection," etc. For each rule, we can assign weights to each group. We can store rule match statistics for each group within executable itself. This ensures that the information is readily available and accessible when generating reports or displaying rule match details. While testing caps on a file, if we can possibly categorise the file to a group then we can sort the rules matches based on their entropy for that group and display the interestingness.
that's an interesting idea @Aayush-Goel-04. I can see how entropy has been used in ML systems like this before, so it might apply to capa rules and features, too.
that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?
that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?
That can be a good start, Also If capa allows for third-party rules, we need to define a standard format or mechanism for third-party rules to include their own statistics or contribute to the overall statistics collection.
We can add a field in rule meta as probabilty of occurence for each rule. Also testing all rules on a large dataset would require lot of time and power, at starting we can try this for a small set of rules and sample dataset and them run on a simple exe file. any thoughts @williballenthin
@williballenthin I ran some test for this. Below are results.
Order of Capabilities currently shown
Order of capabilities after probability is integrated.
The rules are ordered with least probability at top.
Below file contains number of occurences for each rule in all capa-testfiles. entropy.xlsx
@williballenthin There two options we can either add a entropy in meta for each rule, which will be used while rendering.
rule:
meta:
name:
namespace:
authors:
scope: file
mbc:
references:
examples:
entropy: 10
Or we store a results within executable itself.
For third party rules
@williballenthin what are your thoughts on above comments !
Or we store a results within executable itself.
I think I prefer this strategy, since I think it would be a burden to expect rule authors to collect the prevalence of their rule as soon as they author it. Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.
I expect that we'll be able to provide a prevalence table derived from VT; however, this data isn't approved for public release yet. Let's assume it will be available for when we merge the final representation of this data and use your example data in the meantime.
For third party rules ... or set default value as 0 or 1 for all third party rules.
I think this makes sense. And, it may encourage people to contribute their rules to the common set so they can see prevalence information.
Thank you @Aayush-Goel-04 for taking the time to update the rendering based on the prevalence. I like how it puts the "more interesting" rules towards the top.
I think if we'd want to use this format, we should display the prevalence in a column so that users can see why the ordering is the way it is.
Alternatively, I would like to explore finding a cutoff between "common" and "uncommon" and highlighting the rules that are uncommon (via a different output color and/or perhaps a *
next to their name). This way, users don't have to guess about how to interpret the prevalence numbers and can rely on capa's recommendations. It also lets us use the existing output format (which is ordered by namespace, which has nice properties, like grouping of similar things).
Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.
I am aware of one approach, which is to embed the results directly into the executable's resources using either JSON or pickle file formats. However, I'm interested to know if there are any alternative approaches available.
for highlighting rules I have following ideas :
rare: (0, 0.1)
, uncommon: (0.1 , 0.3)
, common: (0.3, 1)
. range can be decided later on based on how final data is calculated.0
can have a *
next to their name as u said. Whats are ur thoughts @williballenthin !
@williballenthin , below are the sample screenshots of render
I think below rendering this looks better.
rare
refers to prob < 0.05 or if number of matches for a rule is < 30 in all capa-test-files & common
refers to prob > 0.05.
After filtering based on probability they are ordered based on namespace and name.
I think this is pretty neat! How would you propose to handle new rules with no prevalence data (yet)? Show them as unknown
?
Their entropy value will be taken as zero and they will be ordered based on namespace and their prevalence will be shown as unknown.
Ok, I wonder about these alternatives:
show two tables
Instead of this i think it would be better to seperate them with a line in the table.
good idea, that could work well
good idea, that could work well
common
(known entropy) and unknown
(no prevalence data) ones can also be seperated but then there would be no sense in sorting based on name and namespace. I propose only two sections , coloring can be discussed.
what are ur thoughts @williballenthin @mr-tz
I like it! Minor adjustments could be:
@mr-tz
rare
: blue, common
: cyan(default color for capability), unknown
: no color. We can decide on color for rare.
in my opinion format in 2nd image looks better.
Agreed, one and two look good. Green may suggest "good" (vs. red is "bad") in some context so we may want to stick to other colors.
@mr-tz , then I think it would be better to stick with current coloring cyan
.
Since rare
ones are present as seperate section they can have same color as capabilities and for common section we can leave unknown
as uncolored and common
as colored (cyan)
in case no rare
present
i wonder if we should color the rule name the same as the prevalence column. as is, we use color to convey information in one column (prevalence) but in another column (name) it's just for highlighting. i think this is confusing.
alternatively, maybe we could use different colors for names/prevalence, but then we run the risk of introducing too many colors.
good points, Willi, I like different colors if we can find a good selection
we can improve the report by showing how commonly a rule matches globally/against benign samples/against malware. this context can help a user decide if a match is interesting or not. for example "open a file" matches everywhere, so its not usually "interesting" while "encrypt with FakeM" is quite uncommon and therefore "interesting".
in order to do this, we need to collect wide scale statistics on where each capa rule matches. we also need a way to store/provide this information - embed in the rules? distribute within the standalone exe? and how does this interact with third-party rules?