Add the ability to select which functions or processes you which to extract capabilities from

yelhamer commented 3 months ago

This PR adds the ability to select which function/process capa should extract capabilities from. The proposed syntax is as follows:

$ capa malware.exe --functions 0x645fa0,0x543dd0,0x630ac0 # static analysis
$ capa malware.log --processes 3288,4321,3234 # dynamic analysis

I haven't added a testcase for dynamic analysis. I am planning to do so once the Drakvuf feature extractor gets merged (#2143) since that's a big motive for this PR.

Nuance: I couldn't find the right names for some of the internal variables, so please feel free to set them as you wish.

Thanks :)

Checklist

[ ] No CHANGELOG update needed
[ ] No new tests needed
[ ] No documentation update needed

williballenthin commented 3 months ago

First, I think this is a very reasonable feature to add, especially with the Drakvuf sandbox support! I'm happy that we should be able to remove similar logic in show-features (here) with these changes. It seems like there are multiple places that this would fit in.

There are two things to discuss:

the command line argument syntax, and
the design of the filtering.

For (1), I don't imagine any major problems, though we may want to consider if these arguments are commonly used enough to be shown all the time, or only in an "expert" mode, or suggested when the Drakvuf sandbox is encountered, etc. That being said, let's be careful not to get derailed by tiny details and hold up the merge. If there's debate, the arguments could always be considered "experimental" for a bit until we stabilize them.

For (2), the question is: how do we filter the functions/scopes/features, particularly in a way that enables further extension, if necessary? That is, say we want to also filter on threads or basic blocks, could we build this easily?

The primary place to consider is the function signature to find_capabilities (and related routines). I am concerned about adding too many optional arguments that become difficult to reason about. I'd prefer the signature of this core routine to be as succinct as possible.

An alternative design is to introduce wrappers around the feature extractors that can filter the scopes/features. They would act just like the underlying feature extractor, but would yield only the entries that are requested. Then, the wrapped feature extractor can be passed around as-is today. Notably, we can trivially create wrappers for different scopes/features without threading optional arguments around. Also, they could potentially be combined (though I don't think we're likely to really need this functionality).

For example:

class StaticFeatureExtractorFilter:
    def __init__(self, inner: StaticFeatureExtractor):
        self.inner = inner

    def __getattr__(self, name):
        if hasattr(self, name):
            # if the filter has an override, use that
            return getattr(self, name)
        # otherwise use the inner feature extractor
        return getattr(self.inner, name)

class FunctionFilter(StaticFeatureExtractorFilter):
    def __init__(self, inner: StaticFeatureExtractor, functions: Set[Address]):
        super().__init__(self, inner)
        self.functions = functions

    def get_functions(self):
        yield from (f for f in self.inner.get_functions() if f.address in self.functions)

Then we can use the filter like so:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Of course, we can filter processes/threads/basic blocks/etc. in the same way.

Would you be open to discuss alternative designs like this @yelhamer? Or any thoughts @mr-tz @mike-hunhoff

yelhamer commented 3 months ago

@williballenthin I really like the design you proposed and I'm willing to implement it. Would the filter classes reside in the base_extractor.py file?

Also, should we make the base "StaticFeatureExtractorFilter" a child of the "StaticFeatureExtractor" class? since I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

williballenthin commented 3 months ago

I think there are some cases (if I am not mistaken) where we use the instance of the extractor we're passing to determine whether the analysis is static or dynamic.

Good point. Also I think this will better satisfy mypy.

OTOH, I'm not sure that the hasattr checks will work as written (since the base class has empty implementations of these) so that will take some tweaking. Should still be possible.

We should also pass along the inner name appropriately, since I think the metadata structure includes the feature extractor name.

yelhamer commented 3 months ago

@williballenthin I have also thought of the following possible implementation using a function factory:

def FunctionFilter(extractor: StaticFeatureExtractor, functions: Set) -> StaticFeatureExtractor:
    from types import MethodType

    get_functions = extractor.get_functions  # fetch original get_functions()

    def filtered_get_functions(self):
        yield from (f for f in get_functions() if f.address in functions)

    extractor.get_functions = MethodType(filtered_get_functions, extractor)
    return extractor

Then we can do as you suggested:

wanted_functions: Set[Address] = get_wanted_functions_from_cli(args)
if wanted_functions:
    # if the user wants to only show specific functions, we filter them down
    feature_extractor = FunctionFilter(feature_extractor, wanted_functions)
else:
    # otherwise, we use the full feature extractor
    pass

...

# use either the filtered extractor or the full extractor interchangably
find_capabilities(feature_extractor, ...)

Another question remains which is whether we want to register which filters an extractor has installed in it. If we want to do so then we might just consider storing the set of desired functions as an attribute in the extractor, then reference it internally in the get_functions() routine (without needing any wrapping or so).

williballenthin commented 3 months ago

I have also thought of the following possible implementation using a function factory

This looks like it would also work. I'm not sure of the pros/cons right now, so perhaps try one of the new implementations and let's see how it feels?

yelhamer commented 3 months ago

@williballenthin @mr-tz , I have made some adjustments right now, please feel free to review them.

Also, this PR would make the Drakvuf extractor (and other extractors) faster, but I was also thinking of also applying the function/process filters to the extractors when they are constructed, so that non-essential functions/processes don't get loaded in the first place.

My main motive behind that is that the Drakvuf extractor consumes too much RAM upon initialization, and it would be nice to not create Pydantic models for the (winapi/native) calls that won't be used anyways. I am thinking of doing that (once both PRs I have open are merged) by passing the list of target processes to the feature extractor when it gets constructed (i.e., passing them to the get_extractor_from_cli() method down to the extractor.from_raw_report() method), and I was wondering whether the approach we have in this PR makes sense for that goal. In other words, do you think there could be a another approach to the problem of this PR that would allow us to filter unnecessary processes/functions upon extractor construction? or is it better that we keep the two separate?

Idk if what I said makes sense :sweat_smile: , if it isn't then please let me know and I'll try to reword it.

williballenthin commented 3 months ago

I see what you're saying about extractors using the CLI target args so they don't have to do so much work.

I think it's ok for the current strategy to work alongside extractors using the CLI args to do less work. Though I'd like to see how many changes are needed to thread this new information to the Drakvuf extractor.

Alternatively, if only Drakvuf will usually use this targets functionality, I wonder if we can instead provide some utilities to pre-process the report so that it's optimized for capa.

mr-tz commented 1 month ago

I'll defer to @williballenthin for final review and to comment on the maturity of this to be merged.

yelhamer commented 1 month ago

@williballenthin I think I have addressed all of the the comments. Let me know if there's anything else I need to do :)

williballenthin commented 1 month ago

ah, CLA failure is cause GH associated the wrong email with my suggestions. well, its me, and i'm covered under the CLA.

williballenthin commented 1 month ago

let's go!

mandiant / capa

Add the ability to select which functions or processes you which to extract capabilities from #2156

Checklist