mlsecproject / combine

Tool to gather Threat Intelligence indicators from publicly available sources
https://www.mlsecproject.org/
GNU General Public License v3.0
652 stars 179 forks source link

use more of the feed #84

Closed paulpc closed 9 years ago

paulpc commented 9 years ago

why not keep some of the metadata from the feeds and use it for enrichment? For example, the AlienVault feed has some interesting information as to why that IP is there and would make for better context.

alexcpsec commented 9 years ago

Yes, I agree with that 100% and would be one of the objectives of the plugin model in #23 to have precise parsing from each feed and normalizing the different types of information we get from different ones.

There are some pre-requisites to get there, but if you want to help us categorizing the different kinds of metadata the feeds have we would greatly appreciate it.

paulpc commented 9 years ago

on my way there. First, i'm working on shoving the results into CRITs (https://github.com/crits/crits)

alexcpsec commented 9 years ago

Nice! :+1: Let me know how that goes :)

alexcpsec commented 9 years ago

Hey, @mgoffin, can you give us some pointers on the best way to integrate with CRITS?

mgoffin commented 9 years ago

Sure! The best thing for now would probably be to write a script using the CRITs API to consume the feed and ingest it into CRITs. Ultimately what would be more beneficial is to create a service which has the ability to pull down the feed, parse it, display results to a user, and let them "approve" which items to accept into the system. That will probably become the standard model for any service(s) that deal with feed ingestion.

alexcpsec commented 9 years ago

Cool! I saw the services API on the Wiki, I guess that is what you mean, right?

Do you have a reference implementation I could look at maybe? it sounds like a good idea to build this integration as more and more people use CRITs

mgoffin commented 9 years ago

You'll wanto check out the Authenticated API on the wiki. It gives some examples and such. It's not 100% but you can read and write all of the different TLOs. Just can't do updates or removal.

alexcpsec commented 9 years ago

:+1: Thanks!

paulpc commented 9 years ago

@alexcpsec @mgoffin working on uploading the IOCs via the web API - it's a bit too slow (a few hours for the 300K+ IPs), so i'll try multithreading and if that works better, i'll submit the code for review.

mgoffin commented 9 years ago

I'll note that we haven't tried hammering the API like that before, so we don't have any useful benchmarks for what speeds we should be getting :)

paulpc commented 9 years ago

as for the original topic and more context, before Combine i wrote something to do this and I implemented it by trying to get STIX-like fields from the sources. I did that by defining my sources in this format:

{
  "impact": "high", 
  "source": "malwareDomainList",
  "campaign":"testCampaign", 
  "confidence": "medium", 
  "format": "^\\\".*\\\"\\,\\\"(.*?)\\\"\\,\\\"(\\d+\\.\\d+\\.\\d+\\.\\d+|-)\\\"\\,\\\"(.*?)\\\"\\,\\\".*?\\\"\\,\\\".*?\\\"\\,\\\"(\\d+|-)\\\"", 
  "reference": "http://www.malwaredomainlist.com/updatescsv.php", 
  "fields": ["URI - URL", "Address - ipv4-addr", "URI - Domain Name","Address - asn"] 
}

My intention was to design a relationship engine for all the IOCs from here and upload them related into CRIts, but i never got to it

paulpc commented 9 years ago

@alexcpsec and @mgoffin , here's my single threaded code: https://github.com/paulpc/combine. I'll wait until I can get better performance before I submit an official pull request

alexcpsec commented 9 years ago

@paulpc Got the gist of it by looking at your code, nice work. To speed things up by making the requests parallel, I would suggest you have a look at the grequests package we are using on reaper.py.

paulpc commented 9 years ago

@alexcpsec , i'll give it a look. I did it manually using multithread and was able to do it 25% faster for 5380 IPs/Domains - not sure it's worth the code complications yet.

Fetching inbound URLs
Fetching outbound URLs
Storing raw feeds in harvest.json
Loading raw feed data from harvest.json
Evaluating http://www.projecthoneypot.org/list_of_ips.php?rss=1
Parsing feed from http://www.projecthoneypot.org/list_of_ips.php?rss=1
Evaluating http://www.openbl.org/lists/base_30days.txt
Parsing feed from http://www.openbl.org/lists/base_30days.txt
Evaluating http://www.blocklist.de/lists/ssh.txt
Parsing feed from http://www.blocklist.de/lists/ssh.txt
Parsing feed from http://www.malwaregroup.com/ipaddresses
Parsing feed from http://malc0de.com/bl/IP_Blacklist.txt
Parsing feed from http://www.nothink.org/blacklist/blacklist_malware_dns.txt
Storing parsed data in crop.json
Reading processed data from crop.json
1413901165.67 *** trying single thread***
1413901165.67 reading configs
1413901165.67 going through list
don't yet know what to do with: None[ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa]successfully added 5380 IP addresses and 201 domains

1413901615.68 done in  450.012173176  seconds
make sure you have the following sources in CRITs: [u'www.projecthoneypot.org', u'www.openbl.org', u'www.blocklist.de', u'www.malwaregroup.com', u'malc0de.com', u'www.nothink.org']
1413901615.68 *** trying multi thread***
1413901615.68 reading configs
1413901615.69 initializing queue
1413901615.7 starting threads
don't yet know what to do with: None[ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa]

1413901945.92 done in  330.232031107  seconds
alexcpsec commented 9 years ago

So maybe 5-ish mins for 5500 indicators? That is not too bad. As to Mike's point, who knows how much CRITs can handle. :)

LMK if you want to merge back when you think you are ready. We might tinker with it in the near future or so to try to add grequests to it.

paulpc commented 9 years ago

will do - i'm testing with a few more indicators (a couple more blocklist.de files), will clean up my code and submit it. I uploaded my current code on my branch.

Paul Poputa-Clean

On Tue, Oct 21, 2014 at 8:24 AM, Alex Pinto notifications@github.com wrote:

So maybe 5-ish mins for 5500 indicators? That is not too bad. As to Mike's point, who knows how much CRITs can handle. :)

LMK if you want to merge back when you think you are ready. We might tinker with it in the near future or so to try to add grequests to it.

— Reply to this email directly or view it on GitHub https://github.com/mlsecproject/combine/issues/84#issuecomment-59945487.

paulpc commented 9 years ago

turns out, the more indicators, the more speed gains:

Fetching inbound URLs
Fetching outbound URLs
Storing raw feeds in harvest.json
Loading raw feed data from harvest.json
Evaluating http://www.projecthoneypot.org/list_of_ips.php?rss=1
Parsing feed from http://www.projecthoneypot.org/list_of_ips.php?rss=1
Evaluating http://www.openbl.org/lists/base_30days.txt
Parsing feed from http://www.openbl.org/lists/base_30days.txt
Evaluating http://www.blocklist.de/lists/ssh.txt
Parsing feed from http://www.blocklist.de/lists/ssh.txt
Evaluating http://www.blocklist.de/lists/apache.txt
Parsing feed from http://www.blocklist.de/lists/apache.txt
Evaluating http://www.blocklist.de/lists/asterisk.txt
Parsing feed from http://www.blocklist.de/lists/asterisk.txt
Evaluating http://www.blocklist.de/lists/bots.txt
Parsing feed from http://www.blocklist.de/lists/bots.txt
Parsing feed from http://www.malwaregroup.com/ipaddresses
Parsing feed from http://malc0de.com/bl/IP_Blacklist.txt
Parsing feed from http://www.nothink.org/blacklist/blacklist_malware_dns.txt
Storing parsed data in crop.json
Reading processed data from crop.json
1413904907.94 *** trying single thread***
1413904907.94 reading configs
1413904907.94 going through list
-- omitted parsing issues for brevity -- 
successfully added 21444 IP addresses and 201 domains
1413907121.27 done in  2213.32997704  seconds
make sure you have the following sources in CRITs: [u'www.projecthoneypot.org', u'www.openbl.org', u'www.blocklist.de', u'www.malwaregroup.com', u'malc0de.com', u'www.nothink.org']
1413907121.27 *** trying multi thread***
1413907121.27 reading configs
1413907121.27 initializing queue
1413907121.33 starting threads
-- omitted parsing issues for brevity -- 
1413908147.52 done in  1026.25563312  seconds

I'll get everything ready for it and submit it for a pull request

alexcpsec commented 9 years ago

Looks good. Thanks!

krmaxwell commented 9 years ago

So TL;DR: this is multithreading the submission to CRITs and possibly grabbing some additional data from the feeds?

alexcpsec commented 9 years ago

No extra info from feeds in this submission, just crits. But the original discussion was about the extra info. :)

On Wed, Oct 22, 2014 at 7:33 AM, Kyle Maxwell notifications@github.com wrote:

So TL;DR: this is multithreading the submission to CRITs and possibly grabbing some additional data from the feeds?

Reply to this email directly or view it on GitHub:

https://github.com/mlsecproject/combine/issues/84#issuecomment-60094013


This e-mail message and any files transmitted with it contain legally privileged, proprietary information, and/or confidential information, therefore, the recipient is hereby notified that any unauthorized dissemination, distribution or copying is strictly prohibited. If you have received this e-mail message inappropriately or accidentally, please notify the sender and delete it from your computer immediately.

paulpc commented 9 years ago

sorry, @technoskald! discussion got derailed with CRITs. We can get back to the metadata when I have time to code some more. I might wait and see what comes out of the labeled-feeds-branch. Do you know if the conf reader library will read regex out of a conf file or try to interpret / clobber them?

krmaxwell commented 9 years ago

OK, so this is just about CRITs? Cool then. :+1: