Labeled feeds - Githubissues

gbrindisi commented 9 years ago

Hi! I've managed to tidy up the code a bit.

I'm aware that the dot-based nomenclature I've chosen is not so pretty but I still find useful the overall functionality. Anyhow check it out and let me know if you would like something different.

The config I've used is the following:

[feeds.outbound]
#malwaregroup     = http://www.malwaregroup.com/ipaddresses
#malc0de          = http://malc0de.com/bl/IP_Blacklist.txt
#zeustracker      = https://zeustracker.abuse.ch/blocklist.php?download=ipblocklist
#spyeyetracker    = https://spyeyetracker.abuse.ch/blocklist.php?download=ipblocklist
#palevotracker    = https://palevotracker.abuse.ch/blocklists.#php?download=ipblocklist
alienvault       = http://reputation.alienvault.com/reputation.data
#nothink-malware-dns = http://www.nothink.org/blacklist/blacklist_malware_dns.txt
#nothink-malware-http = http://www.nothink.org/blacklist/blacklist_malware_http.txt
#nothink-malware-irc = http://www.nothink.org/blacklist/blacklist_malware_irc.txt

[feeds.inbound]
#projecthoneypot = http://www.projecthoneypot.org/list_of_ips.php?rss=1
#openbl = http://www.openbl.org/lists/base_30days.txt
#blocklist-ssh = http://www.blocklist.de/lists/ssh.txt
#blocklist-apache = http://www.blocklist.de/lists/apache.txt
#blocklist-asterisk = http://www.blocklist.de/lists/asterisk.txt
#blocklist-bots = http://www.blocklist.de/lists/bots.txt
#blocklist-courierimap = http://www.blocklist.de/lists/courierimap.txt
#blocklist-courierpop3 = http://www.blocklist.de/lists/courierpop3.txt
#blocklist-email = http://www.blocklist.de/lists/email.txt
#blocklist-ftp = http://www.blocklist.de/lists/ftp.txt
#blocklist-imap = http://www.blocklist.de/lists/imap.txt
#blocklist-ircbot = http://www.blocklist.de/lists/ircbot.txt
#blocklist-pop3 = http://www.blocklist.de/lists/pop3.txt
#blocklist-postfix = http://www.blocklist.de/lists/postfix.txt
#blocklist-proftpd = http://www.blocklist.de/lists/proftpd.txt
#blocklist-sip = http://www.blocklist.de/lists/sip.txt
#ciarmy = http://www.ciarmy.com/list/ci-badguys.txt
alienvault-inbound = http://reputation.alienvault.com/reputation.data
#drg-ssh = http://dragonresearchgroup.org/insight/sshpwauth.txt
#drg-vnc = http://dragonresearchgroup.org/insight/vncprobe.txt
#rulez = http://danger.rulez.sk/projects/bruteforceblocker/blist.php
#sans = https://isc.sans.edu/ipsascii.html
#nothink-ssh = http://www.nothink.org/blacklist/blacklist_ssh_day.txt
#packetmail = https://www.packetmail.net/iprep.txt
#autoshun = http://www.autoshun.org/files/shunlist.csv
#haleys = http://charles.the-haleys.org/ssh_dico_attack_hdeny_format.#php/hostsdeny#.txt
#virbl = http://virbl.org/download/virbl.dnsbl.bit.nl.txt
#botscout = http://botscout.com/last_caught_cache.htm

[feeds.parsers]
alienvault = process_alienvault
alienvault-inbound = process_alienvault
projecthoneypot = process_project_honeypot
rulez = process_rulez
sans = process_sans
packetmail = process_packetmail
autoshun = process_autoshun
haleys = process_haleys
drg-ssh = process_drg
drg-vnc = process_drg
malwaregroup = process_malwaregroup

krmaxwell commented 9 years ago

I have pulled this over into the gbrindisi-labeled-feeds branch and fixed the merge conflicts. Will try to test tonight.

krmaxwell commented 9 years ago

Alternately @gbrindisi if you can pull the current master into yours, the conflict is pretty easy to fix and it will update this PR.

gbrindisi commented 9 years ago

Done! ...I'm not totally sure I did it right :goat:

alexcpsec commented 9 years ago

So, I am not ignoring this, but I was thinking that this is an opportunity to begin to address the extra fields that would be required from a more robust TI feeds parsing engine (such as confidence, campaign, other notes).

I have some work to do around defining that on some other things I am working on the internal parts of the non-open source code I have, so I'd like to propose a direction for us to move forward by early next week.

@gbrindisi I think this is a step in the right direction, but I want to make sure we do not code ourselves into a corner :)

gbrindisi commented 9 years ago

Ok I understand :) Let me know what you decide and if you and @technoskald want I can help with the coding.

Feel free to mail or message me on slack (I'm lurking daily btw ;)).

krmaxwell commented 9 years ago

Just a quick poke here to see where we are on this :)

paulpc commented 9 years ago

I can help out with this - in the past, i did something like this for feed parsing, and had the engine just read the line regex from the conf file. I also used crits-specific indicator names, but that's obviously just semantics and easily changed:

[
{
  "impact": "high", 
  "source": "malwareDomainList",
  "campaign":"testCampaign", 
  "confidence": "medium", 
  "format": "^\\\".*\\\"\\,\\\"(.*?)\\\"\\,\\\"(\\d+\\.\\d+\\.\\d+\\.\\d+|-)\\\"\\,\\\"(.*?)\\\"\\,\\\".*?\\\"\\,\\\".*?\\\"\\,\\\"(\\d+|-)\\\"", 
  "reference": "http://www.malwaredomainlist.com/updatescsv.php", 
  "fields": ["URI - URL", "Address - ipv4-addr", "URI - Domain Name","Address - asn"] 
},
{
  "impact": "medium",
  "confidence": "medium",
  "campaign":"testCampaign",
  "format": "(.+)",
  "reference": "https://zeustracker.abuse.ch/blocklist.php?download=compromised", 
  "fields": ["URI - URL"], 
  "source": "ZeusTracker"
}
]

If you guys want to go somewhere like this, i can create a branch and see if I can bastardize the code to allow for this. @gbrindisi, where did you want to put the custom feed parsers? The conf can stay in the standard conf format, it doesn't have to be changed to json (i just did json at the time)

alexcpsec commented 9 years ago

I know this is from October, and I told I would think about this, but believe it or not I have not finished thinking about it yet. :confused:

I am trying to align this with some other ideas for projects I am entertaining right now (including something for presentations on BlackHat and DefCon 2015). I will have a "recommendation" you you guys to give input on before the end of the holidays. :santa: :christmas_tree:

paulpc commented 9 years ago

That sounds a lot like 'here, you guys write code for my blackhat prez.'

We should probably think about relationships between indicators, contextual information, bla bla bla. Maybe looking at them from a STIX/CYBOX standpoint would help, as long as we can generate the relationships between indicators (e.g. not just everything is connected to the ZeusTracker feed, but x.x.x.x was seen by zeustracker in conjunction with www.pornmalware.com, which was seen by alienware along with y.y.y.y, and the c2 communication was using ZZZ user agent)

alexcpsec commented 9 years ago

@paulpc I just realized that did sound wrong. That was not what I meant, and I am sorry if it sounded that way. I am aiming for a minimum set of fields and parameters that would give anyone analyzing the data the ability pivot and aggregate the data in multiple ways. Having this kind of flexibility would help me on some of the things I want to work on and I am sure would help others as well.

So, I think a bare minimum would be:

Source: feed or reference where the indicator came from
type: IPv4, FQDN, user-agent, MD5 hash, etc, etc
category: name of the malware / exploit kit / dropper family, if available (ideally respecting different versions having different names)
campaign: public campaign or private org incident related to the indicator, if available
impact & confidence: could be read from the feeds or configured as a default on the configuration for the feeds (as in I think feed XXX is a "medium" confidence, regardless what they say)

I am not sure how you would do the relationship matching without a storage back-end such as CRITs, so maybe that is what you mean. I want to make sure we are feeding something such as CRITs enough data so what you described is possible to be done there with queries on different fields.

What are other fields you would like to see in this?

paulpc commented 9 years ago

@alexcpsec, why constrain ourselves to a few indicator types? We could use the openIOC (http://schemas.mandiant.com/) or the STIX(http://stixproject.github.io/documentation/idioms/) dictionaries. We don't have to go all out and output STIX or openIOC XML, but we don't need to reinvent the language either and should provide an easy conversion mechanism (maybe if we wrote a STIX output function, it would help materialize the ideas, but i realize it's obscene scope creep).

As for relationship matching, it would obviously be easy to do it in a backend system a la CRITs, Avalanche / Soltra Edge, commercial threatX. But since we're ingesting a bunch of OS-INT feeds in combine, we could do a very rudimentary relationship model here based on what we see in which feeds, any common elements, and any metadata present with those feeds. I had ideas to do some post hoc relationship building in CRITs, but you would lose the point-in-time aspect.

For example, malwaredomainslist, has some registrar info, domain, ip, ASN. Alienware has category and some confidence information. We can maybe connect all those before inserting them in an intelligent analysis system (CRITs) to help further analysis and documentation of point-in-time relationships.

This is all a moot point if all the end-user is doing is putting the IPs in a firewall blocklist.

krmaxwell commented 9 years ago

First, I don't think correlation / relationship matching is in scope for Combine. It grabs the data, does minimal normalization, then outputs it in forms other stuff can consume. And we've definitely always considered the STIX stack to be on the roadmap; see issue #33 for example.

The use case here, as I understand it, is to grab a bit of extra metadata and make that available to users. Some of them use STIX or OpenIOC and some don't (although anyone putting it in a firewall blocklist is omg doing it wrong). For this issue, we should probably just make sure we consider the metadata we should grab and add that to the data model. The list from @alexcpsec above is a good start.

alexcpsec commented 9 years ago

Here is what I am thinking:

Immediate action:

[ ] Define a minimal schema that makes use of varied metadata from the feeds - this is a no brainer, we already discussed this at #84 and it is hinted at #23 way back on the first version
[ ] Extract the additional metadata from the feeds - again referencing #84 , and @paulpc 's suggestion of the regex and capture points is good, but I'd rather have something more involved (I mean actual Python functions) if the extraction is easier with that. I think a proof of concept with a few feeds for that idea would be awesome. Also, check out the work that @btv is suggesting at #98 to clean up the parsing.

I would not, however, extract data that we could optionally enrich later on winnower (such as ASN). I appreciate that the enrichment code could be faster as it is now, but that is what that is for.

The point-in-time aspect could be mitigated by having 2 different timestamps on the entries:

A timestamp extracted from the feed (if the feed says when the indicator entered)
A timestamp from when the feed was scraped.

I'd try to stay away from the relationship matching or correlation just now, at least until these more immediate things can be done. Combine is not trying to be fully fledged TI Platform, and maybe these would be functionalities for a separate project ("hay silo", anyone?). Or even leave this to CRITs, MISP or even CIF.

What do you people think?

paulpc commented 9 years ago

I understand the correlation and relationships might be out of the purview. The ASN example wasn't meant as a replacement for the current utility, but just another datapoint, and since Combine is not a TI platform, no reason to worry about my point-in-time complications.

The reason why i went with regex is b/c i can do meta programming in the config file - for new feeds i don't have to change my code to allow for column names, mapping to existing columns, et cetera, but in stead I can define everything in the regex and fields. Downside (and it's a pretty big one) is that i am expecting normalized input and I am doing minimum error checking.

@btv's idea is awesome! To take it a bit further, since JSON is our bread and butter for transporting observables between the modules, why not doing something like csv.DictReader for the csv-formatted fields and get a dictionary object out of it automatically. It would be parsed by the csv library, so, it might do some of the heavy loading. Unfortunately, I thing it will drive us further into conversations about feed-format-specific parsers and complicating the retrieving and parsing algorithms.

It seems we're heading towards creating/adopting a normalized observable-descriptive language and coming up with transform tables from all the feeds ingested. If so, JSON / XML / CSV output would be pretty trivial regardless of how complicated we decide to get with the metadata.

alexcpsec commented 9 years ago

It seems we're heading towards creating/adopting a normalized observable-descriptive language and coming up with transform tables from all the feeds ingested. If so, JSON / XML / CSV output would be pretty trivial regardless of how complicated we decide to get with the metadata.

Yes, I think we should get this done right first before making the scope bigger.

gbrindisi commented 9 years ago

@paulpc

 @gbrindisi, where did you want to put the custom feed parsers?

The parsers are just functions in thresher.py so they could be just moved outside in a seprate module to allow easy customization.

paulpc commented 9 years ago

Gianluca, I poked around and got something similar on my version. I'll put in a pr On Jan 8, 2015 10:00 AM, "Gianluca Brindisi" notifications@github.com wrote:

@paulpc https://github.com/paulpc

@gbrindisi, where did you want to put the custom feed parsers?

The parsers are just functions in thresher.py so they could be just moved outside in a seprate module to allow easy customization.

— Reply to this email directly or view it on GitHub https://github.com/mlsecproject/combine/pull/86#issuecomment-69200287.

krmaxwell commented 9 years ago

That would be super cool and relevant to #23

alexcpsec commented 9 years ago

I am closing this PR unmerged and we should focus on #110 that has a more complete implementation of these ideas we started discussing here.

I really want to thank @gbrindisi for kicking off this here and providing us with the cornerstone for this discussion. Please help us with #110 as well. :)

gbrindisi commented 9 years ago

@alexcpsec you are welcome! :)

mlsecproject / combine

Labeled feeds #86