Hello from CCCS! 🍁 - Githubissues

cccs-rs commented 9 months ago

We're currently using this project as a plugin service in Assemblyline, our open source malware analysis platform, to perform reputation checks for network-based IOCs and we were wondering if there's any tweaks or optimizations you can recommend for our use case.

We use it to check to see if any domains that are extracted are typosquats of actual legitimate domains (based on a top domain list). We kind of do the reverse of what I believe the tool was intended for.

For example, we'll take a domain let's say paypal0.com and run the tool to generate all possibilities of that domain and then compare with our top domain list, if there's a hit then we mark paypal0.com to be a typosquat of the legitimate domain paypal.com. Source code

We've explored the possibility of doing the pre-compute beforehand (which is more in tune with what the tool was meant for) but depending on the number of items in the list and the length of the domains + typosquat variations, it could take a very long time if we were doing this at image build time (especially if there's a possibility we may not even use some domains + typosquat variations).

What we've found is that this particular feature doesn't scale well with production workloads (where let's say there's a lot of domains extracted) and understandably it's because we're aren't setting a limit to the number of possibilities to generate because we want to be sure we leave no stone unturned.

If you have any recommendations on what we could do to improve this feature, it would be greatly appreciated! 😁

gallypette commented 7 months ago

Good idea, we are quite sad not having it first :smile: after a quick brainstorm with @DavidCruciani and @adulau, an idea would be to create BloomFilters using MISP warning-lists (https://github.com/MISP/misp-warninglists/blob/main/lists/tranco/list.json https://github.com/MISP/misp-warninglists/blob/main/lists/google-chrome-crux-1million/list.json)

@DavidCruciani launched the job for generating the list. Then we will put this into https://github.com/dcso/flor bloomfilters (or maybe https://github.com/hashlookup/private-search-set ?).

If you have any list of domains your are interested in, shoot ;)

cccs-rs commented 7 months ago

Thanks for the response! I may have to do some reading on BloomFilters and how they work.

I thought it might be of interest to specify the line that I think prevents us from scaling to production loads. As you can see, for every ambiguous domain we're computing every typosquat variation and not putting a limit to the number of variants (which I think is probably what's slowing us down) then checking for any hits in the top domain set.

Do you think there is any tweaks we can make to the runAll() call to be more efficient while minimizing loss in detection, or is it your opinion that using BloomFilters for the lookups would be the optimization here?

adulau commented 7 months ago

@cccs-rs Please find the Bloomfilter generated from the MISP warning-list tranco 1 million domain list

The Bloomfilter is available at http://cra.circl.lu/trancosquatting.bloom (9G) and you can use any library with the DCSO format support to read/lookup the file such as bloom, flor in Python, fleur in C.

We have some ideas to make a service out of it to just load the Bloom one time. Is there a way to add cache service in assembly line?

gallypette commented 7 months ago

This filter contains all permutations for tranco domain list.

Here is an example for paypa1.com:

jlouis@cpi:~$ echo "paypa1.com" | bloom check trancosquatting.bloom 
paypa1.com

This output means that paypa1.com may be in the permutations generated from the tranco domain list. You will notice that performances are poor because bloom has to load the 9G file on startup for this single query, but bloom supports lists for queries too.

As @adulau mentioned, one solution for assembly line would be to write a small service for the query part.

cccs-rs commented 7 months ago

Interesting. Since Assemblyline is all container-based, I don't think it would impossible to create a dependency container (with persistent storage for caching) to download the Bloomfilter and load it on startup, and have it respond to queries from the analytical service instances.

Supporting lists of queries will also help as we'll likely want to query more than 1 domain per task.

Lots to think on! 😁

typosquatter / ail-typo-squatting

Hello from CCCS! 🍁 #15