mlsecproject / combine

Tool to gather Threat Intelligence indicators from publicly available sources
https://www.mlsecproject.org/
GNU General Public License v3.0
655 stars 171 forks source link

Correct handling of "source" #63

Closed alexcpsec closed 9 years ago

alexcpsec commented 10 years ago

Today, the "source" field corresponds to the URL from where the indicator was gathered from.

According to the docs (and to my opinion :P) it should be an identifying string that describes that source and that should be documented on the Wiki. It bothers me because I cannot match these sources up with the data we provided for the tiq-test samples, so it is an enhancement and a bug at the same time...

Perhaps the thresher_map should be the place for that or somewhere equivalent on the plugin system from #23. Is there a short term solution to this that does not require waiting for the plugin refactoring?

krmaxwell commented 10 years ago

(not ignoring - this one requires :thought_balloon: )

krmaxwell commented 10 years ago

Per @alexcpsec - turn inbound_urls.txt and outbound_urls.txt into proper config files, mapping each config name string into the URLs.

gbrindisi commented 10 years ago

I've played a little with the config files and I've produced a poc code in a local branch to address this issue.

Basically I've added the feeds to the config file, like:

[feeds.outbound]
feed_o_label1 = feed_url1
...
...
feed_o_labelN = feed_urlN

[feeds.inbound]
feed_i_label1 = feed_url1
...
...
feed_i_labelN = feed_urlN

Then reaper.py reads the feeds from the config file (sections feeds.outbound and feeds.inbound) and store the harvested results by label (ie feed_o_label1).

Next I've improved thresher.py too, to let it read the associated parser function from the config file too.

For example in the config the user can now define the preferred parsed function like so:

[feeds.parsers]
feed_whatever = whatever_parser

whatever_parser() is then used to parse the result's labeled as feed_whatever.

This behaviour should be a good starting point to implement a plugin system in which the parser's are read from other modules.

Please let me know if you like this approach, you can find the code in https://github.com/gbrindisi/combine/tree/labeled-feeds

Hopefully I'll be able to tidy up the code a bit more tomorrow.

alexcpsec commented 10 years ago

I was thinking about this today while I was working to merge your stuff and I share your thoughts on this.

I'll have a look at your stuff tonight and comment on some other suggestions.

On Sun, Oct 12, 2014 at 11:28 AM, Gianluca Brindisi notifications@github.com wrote:

I've played a little with the config files and I've produced a poc code in a local branch to address this issue. Basically I've added the feeds to the config file, like:

[feeds.outbound]
feed_o_label1 = feed_url1
...
...
feed_o_labelN = feed_urlN
[feeds.inbound]
feed_i_label1 = feed_url1
...
...
feed_i_labelN = feed_urlN

Then reaper.py reads the feeds from the config file (sections feeds.outbound and feeds.inbound) and store the harvested results by label (ie feed_o_label1). Next I've improved thresher.py too, to let it read the associated parser function from the config file too. For example in the config the user can now define the preferred parsed function like so:

[feeds.parsers]
feed_whatever = whatever_parser

whatever_parser() is then used to parse the result's labeled as feed_whatever. This behaviour should be a good starting point to implement a plugin system in which the parser's are read from other modules. Please let me know if you like this approach, you can find the code in https://github.com/gbrindisi/combine/tree/labeled-feeds

Hopefully I'll be able to tidy up the code a bit more tomorrow.

Reply to this email directly or view it on GitHub:

https://github.com/mlsecproject/combine/issues/63#issuecomment-58814879


This e-mail message and any files transmitted with it contain legally privileged, proprietary information, and/or confidential information, therefore, the recipient is hereby notified that any unauthorized dissemination, distribution or copying is strictly prohibited. If you have received this e-mail message inappropriately or accidentally, please notify the sender and delete it from your computer immediately.

krmaxwell commented 10 years ago

I like the idea of paving the way for plugins. The feed_i_whatever syntax feels a little ugly to me for some reason. Need to brain on this a bit.

gbrindisi commented 10 years ago

Just to put some perspective, this is the test configuration I've used (most of the entries are commented out):

[feeds.outbound]
#malwaregroup     = http://www.malwaregroup.com/ipaddresses
#malc0de          = http://malc0de.com/bl/IP_Blacklist.txt
#zeustracker      = https://zeustracker.abuse.ch/blocklist.php?download=ipblocklist
#spyeyetracker    = https://spyeyetracker.abuse.ch/blocklist.php?download=ipblocklist
#palevotracker    = https://palevotracker.abuse.ch/blocklists.php?download=ipblocklist
alienvault       = http://reputation.alienvault.com/reputation.data
#nothink-malware-dns = http://www.nothink.org/blacklist/blacklist_malware_dns.txt
#nothink-malware-http = http://www.nothink.org/blacklist/blacklist_malware_http.txt
#nothink-malware-irc = http://www.nothink.org/blacklist/blacklist_malware_irc.txt

[feeds.inbound]
#projecthoneypot = http://www.projecthoneypot.org/list_of_ips.php?rss=1

[feeds.parsers]
alienvault = process_alienvault

Also I've used the feeds.XXX naming schema for the config sections to see if it was feasible to abstract the standard inbound/outbound categorization. It's just an idea though.

krmaxwell commented 10 years ago

I like having that confined to the categories / sections much better than having the names. But I am reminded that there are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

gbrindisi commented 10 years ago

I've made a pull request to discuss it better: https://github.com/mlsecproject/combine/pull/86