show the prevalence of rules in the output

williballenthin commented 3 years ago

we can improve the report by showing how commonly a rule matches globally/against benign samples/against malware. this context can help a user decide if a match is interesting or not. for example "open a file" matches everywhere, so its not usually "interesting" while "encrypt with FakeM" is quite uncommon and therefore "interesting".

in order to do this, we need to collect wide scale statistics on where each capa rule matches. we also need a way to store/provide this information - embed in the rules? distribute within the standalone exe? and how does this interact with third-party rules?

Aayush-Goel-04 commented 1 year ago

https://github.com/mandiant/capa/pull/1502#issuecomment-1584266328

Entropy Based Approach one approach is to calculate the entropy of rule matches. Entropy, in this context, refers to a measure of the distribution or variability of rule matches across a dataset. By calculating the entropy of each rule, we can determine how commonly or rarely a rule matches within the dataset.

Incorporating entropy as metadata for each rule in the capa report allows users to quickly assess the distribution and variability of rule matches. This information can aid in distinguishing between frequently occurring rule matches that may be less interesting and those that are relatively rare and more noteworthy.

We can employ a rule ranking system , while displaying caps results we can sort the matched rules based on their entropy levels in metadata and display confidence levels.

While entropy provides a quantitative measure of rule variability, it's important to note that it may not capture all aspects of rule significance. Therefore, additional factors or heuristics might be necessary to provide a more comprehensive assessment. Continuous refinement and improvement of the analysis techniques can enhance the precision and conciseness of the entropy-based information in the capa report.

We can divide the datasets based on groups such as "File Operations," "Network Communication," "Process Injection," etc. For each rule, we can assign weights to each group. We can store rule match statistics for each group within executable itself. This ensures that the information is readily available and accessible when generating reports or displaying rule match details. While testing caps on a file, if we can possibly categorise the file to a group then we can sort the rules matches based on their entropy for that group and display the interestingness.

williballenthin commented 1 year ago

that's an interesting idea @Aayush-Goel-04. I can see how entropy has been used in ML systems like this before, so it might apply to capa rules and features, too.

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

Aayush-Goel-04 commented 1 year ago

that being said, i'd suggest that we also consider the simplest, easy to explain approach: running capa against a bunch of files and recording the number of hits per rule. we can collect this data and distribute the results with subsequent releases of capa. once this works (or doesnt), then we can explore ways to enhance the results if necessary, such as with the entropy idea. thoughts @Aayush-Goel-04 ?

That can be a good start, Also If capa allows for third-party rules, we need to define a standard format or mechanism for third-party rules to include their own statistics or contribute to the overall statistics collection.

Aayush-Goel-04 commented 1 year ago

We can add a field in rule meta as probabilty of occurence for each rule. Also testing all rules on a large dataset would require lot of time and power, at starting we can try this for a small set of rules and sample dataset and them run on a simple exe file. any thoughts @williballenthin

Aayush-Goel-04 commented 1 year ago

@williballenthin I ran some test for this. Below are results.

Order of Capabilities currently shown Screenshot 2023-07-29 113428

Order of capabilities after probability is integrated. Screenshot 2023-07-29 113040

The rules are ordered with least probability at top.

Below file contains number of occurences for each rule in all capa-testfiles. entropy.xlsx

Aayush-Goel-04 commented 1 year ago

@williballenthin There two options we can either add a entropy in meta for each rule, which will be used while rendering.

rule:
  meta:
    name:
    namespace:
    authors:
    scope: file
    mbc:
    references:
    examples:
    entropy: 10

Or we store a results within executable itself.

For third party rules

we can have users define probability for each rule
or set default value as 0 or 1 for all third party rules.

Aayush-Goel-04 commented 1 year ago

@williballenthin what are your thoughts on above comments !

williballenthin commented 1 year ago

Or we store a results within executable itself.

I think I prefer this strategy, since I think it would be a burden to expect rule authors to collect the prevalence of their rule as soon as they author it. Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I expect that we'll be able to provide a prevalence table derived from VT; however, this data isn't approved for public release yet. Let's assume it will be available for when we merge the final representation of this data and use your example data in the meantime.

For third party rules ... or set default value as 0 or 1 for all third party rules.

I think this makes sense. And, it may encourage people to contribute their rules to the common set so they can see prevalence information.

williballenthin commented 1 year ago

Thank you @Aayush-Goel-04 for taking the time to update the rendering based on the prevalence. I like how it puts the "more interesting" rules towards the top.

I think if we'd want to use this format, we should display the prevalence in a column so that users can see why the ordering is the way it is.

Alternatively, I would like to explore finding a cutoff between "common" and "uncommon" and highlighting the rules that are uncommon (via a different output color and/or perhaps a * next to their name). This way, users don't have to guess about how to interpret the prevalence numbers and can rely on capa's recommendations. It also lets us use the existing output format (which is ordered by namespace, which has nice properties, like grouping of similar things).

Aayush-Goel-04 commented 1 year ago

Instead, we can try to periodically collect prevalence information and package it alongside capa for the common usecase.

I am aware of one approach, which is to embed the results directly into the executable's resources using either JSON or pickle file formats. However, I'm interested to know if there are any alternative approaches available.

for highlighting rules I have following ideas :

first we seperate rules based on probability into three sections (each section will be ordered based on namespace).
rare: (0, 0.1), uncommon: (0.1 , 0.3), common: (0.3, 1) . range can be decided later on based on how final data is calculated.
in rare ones those with probability as close to 0 can have a * next to their name as u said.
We can also represent each section with three different colors for visuals.

Whats are ur thoughts @williballenthin !

Aayush-Goel-04 commented 1 year ago

@williballenthin , below are the sample screenshots of render I think below rendering this looks better. rare refers to prob < 0.05 or if number of matches for a rule is < 30 in all capa-test-files & common refers to prob > 0.05. After filtering based on probability they are ordered based on namespace and name.

mr-tz commented 1 year ago

I think this is pretty neat! How would you propose to handle new rules with no prevalence data (yet)? Show them as unknown?

Aayush-Goel-04 commented 1 year ago

Their entropy value will be taken as zero and they will be ordered based on namespace and their prevalence will be shown as unknown.

mr-tz commented 1 year ago

Ok, I wonder about these alternatives:

show two tables: one for rare and one for common and unknown
always sort by namespace but highlight rare rules (or tune down common rules)

Aayush-Goel-04 commented 1 year ago

show two tables

Instead of this i think it would be better to seperate them with a line in the table.

mr-tz commented 1 year ago

good idea, that could work well

Aayush-Goel-04 commented 1 year ago

good idea, that could work well

common(known entropy) and unknown(no prevalence data) ones can also be seperated but then there would be no sense in sorting based on name and namespace. I propose only two sections , coloring can be discussed.

what are ur thoughts @williballenthin @mr-tz

mr-tz commented 1 year ago

I like it! Minor adjustments could be:

same color for capabilities
different colors for rare, unknown, common

Aayush-Goel-04 commented 1 year ago

@mr-tz rare: blue, common: cyan(default color for capability), unknown: no color. We can decide on color for rare.

in my opinion format in 2nd image looks better.

mr-tz commented 1 year ago

Agreed, one and two look good. Green may suggest "good" (vs. red is "bad") in some context so we may want to stick to other colors.

Aayush-Goel-04 commented 1 year ago

@mr-tz , then I think it would be better to stick with current coloring cyan. Since rare ones are present as seperate section they can have same color as capabilities and for common section we can leave unknown as uncolored and common as colored (cyan)

in case no rare present

williballenthin commented 1 year ago

i wonder if we should color the rule name the same as the prevalence column. as is, we use color to convey information in one column (prevalence) but in another column (name) it's just for highlighting. i think this is confusing.

alternatively, maybe we could use different colors for names/prevalence, but then we run the risk of introducing too many colors.

mr-tz commented 1 year ago

good points, Willi, I like different colors if we can find a good selection

mandiant / capa

show the prevalence of rules in the output #520