mlsecproject / combine

Tool to gather Threat Intelligence indicators from publicly available sources
https://www.mlsecproject.org/
GNU General Public License v3.0
652 stars 179 forks source link

Move over to plugins and more #121

Closed sooshie closed 9 years ago

sooshie commented 9 years ago

plugins, code cleanup, moved to key:value for all information passed, did away with inbound and outbound files (all of that is handled via plugins and the json docs), and added clean-mx plugin to demonstrate HASH and URL types that were also added in addition to IPv4 and FQDN, cleaned up enrichment (added DNS resolution), and expanded CSV output (enriched).

I don't have a CRITS setup to test against so I haven't touched that stuff, nor did I touch the tiq_output.

closes #23 closes #102 closes #101 closes #100 closes #79 closes #84 closes #63 closes #37

krmaxwell commented 9 years ago

well I know what I'm doing on my 3-day weekend! this looks great, thanks!

alexcpsec commented 9 years ago

It seems to be missing some pre-requisites. What is this uniaccept thing about? Quick google search indicates this (https://github.com/icann/uniaccept-python) and that it should be a part of dnspython.

Can you help?

$ ./combine.py
Traceback (most recent call last):
  File "./combine.py", line 10, in <module>
    from reaper import reap
  File "/Users/alexcp/src/combine/reaper.py", line 10, in <module>
    import uniaccept
ImportError: No module named uniaccept
alexcpsec commented 9 years ago

You mention it on the README file, but it is not clear how to install it.

alexcpsec commented 9 years ago

I added this to the requirements.txt, seemed to do the trick:

-e git+https://github.com/icann/uniaccept-python.git@2fd43061c729fdd834b93ee64ea33695266ddae0#egg=uniaccept-master
alexcpsec commented 9 years ago

@sooshie A minor annoyance is this "double logging" at baler and winnower:

2015-02-21 15:31:39,661 - combine.baler - INFO - Reading processed data from crop.json
[2015-02-21 23:31:39.661727] INFO: combine.baler: Reading processed data from crop.json
2015-02-21 15:31:46,640 - combine.baler - INFO - Output regular data as CSV to harvest.csv
[2015-02-21 23:31:46.641092] INFO: combine.baler: Output regular data as CSV to harvest.csv

Not familiar with the logbook package, so not sure what is going on here.

krmaxwell commented 9 years ago

Since uniaccept-python is BSD-licensed and no longer maintained, we might as well bring it in directly to the repository (subject to license compliance which will be straightforward for us).

alexcpsec commented 9 years ago

I am tracking this on branch sooshie-master

krmaxwell commented 9 years ago

Also: can you talk a little about the grequests replacement with multiprocessing? Having some trouble on a test system with it.

krmaxwell commented 9 years ago

Specifically, hanging:

(venv)kmaxwell@leibniz:~/src/combine(sooshie-master*)$ python reaper.py
[2015-02-21 23:41:31.652238] INFO: reaper: Loading Plugins
[2015-02-21 23:41:31.685688] INFO: reaper: Processing: sans
[2015-02-21 23:41:31.692073] INFO: reaper: Processing: projecthoneypot
[2015-02-21 23:41:31.693348] INFO: reaper: Processing: malc0de
[2015-02-21 23:41:31.694537] INFO: reaper: Processing: clean-mx
[2015-02-21 23:41:31.695642] INFO: reaper: Processing: autoshun
[2015-02-21 23:41:31.696724] INFO: reaper: Processing: dragonresearchgroup
[2015-02-21 23:41:31.698627] INFO: reaper: Processing: nothink
[2015-02-21 23:41:31.702412] INFO: reaper: Processing: openbl
[2015-02-21 23:41:31.704070] INFO: reaper: Processing: botscout
[2015-02-21 23:41:31.705934] INFO: reaper: Processing: rulez
[2015-02-21 23:41:31.707801] INFO: reaper: Processing: spyeyetracker
[2015-02-21 23:41:31.710219] INFO: reaper: Processing: zeustracker
[2015-02-21 23:41:31.713404] INFO: reaper: Processing: packetmail
[2015-02-21 23:41:31.721279] INFO: reaper: Processing: malwaregroup
[2015-02-21 23:41:31.722663] INFO: reaper: Processing: ciarmy
[2015-02-21 23:41:31.724185] INFO: reaper: Processing: virbl
[2015-02-21 23:41:31.725990] INFO: reaper: Processing: alienvault
[2015-02-21 23:41:31.728221] INFO: reaper: Processing: palevotracker
[2015-02-21 23:41:31.732035] INFO: reaper: Processing: the-haleys
[2015-02-21 23:41:31.734558] INFO: reaper: Processing: blocklist.de
Process Process-38:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "reaper.py", line 24, in get_file
    r = requests.get(url)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', gaierror(-2, 'Name or service not known'))
Process Process-32:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "reaper.py", line 24, in get_file
    r = requests.get(url)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', gaierror(-2, 'Name or service not known'))
Process Process-37:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "reaper.py", line 24, in get_file
    r = requests.get(url)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', gaierror(-2, 'Name or service not known'))
sooshie commented 9 years ago

Wasn't able to reproduce the hanging. Switched over because grequests/python don't play well in 2.7.9. Using version 2.5.1 of requests. I have seen it take a while since a timeout is not specified and it takes a few (~10) seconds to recombine all the results. Putting a timeout value in reaper is probably a decent idea. What versions of Python and requests are you using?

Never paid attention to the double logging, but it's an easy fix. I'll clean it up as well.

uniaccept is used for TLD validation, it seems to be a bit cleaner than a regex.

alexcpsec commented 9 years ago

Updated sooshie-master as well with the new commits.

I am using Python 2.7.8 with requests 2.5.1 and it works fine. I did not do a "fresh install" like in the install guide, so I would probably try that next.

I am ok of leaving uniaccept as it is.

krmaxwell commented 9 years ago

@alexcpsec: Do you want to keep pulling from their repo? I think it's easier to include the library in ours and the license they use mean no CLA type thing is required.

I still need to get to the bottom of why reaper.py breaks for me in this branch.

alexcpsec commented 9 years ago

@krmaxwell I think not bundling it is "cleaner", but I would defer to whatever the standard is on Python.

In fact, I was thinking of unbundling the DNSDB code, so we could make sure users can update it independently.

@sooshie would you like to weigh in?

sooshie commented 9 years ago

I like the idea of not bundling. Perhaps we could fork it and keep a copy here?

Good call on the DNSDB code as well. One of the other things I'd like to do is do enrichment plugins as well.

krmaxwell commented 9 years ago

I like the idea of forking into our own repo and pulling from there (for both things).

alexcpsec commented 9 years ago

I forked them both, but the DNSDB one does not look like a package, so we may need to copy stuff anyway (or do some crazy sub-repo thing)

krmaxwell commented 9 years ago

The great thing about a repository of our own is that we can add code to it, including the code necessary to turn it into a package. :trollface:

krmaxwell commented 9 years ago

Even more broken for me. (Snipped irrelevancies.)

(venv)kmaxwell@newton:~/src/combine(sooshie-master)$ pip install -r requirements.txt 
Obtaining uniaccept-master from git+https://github.com/icann/uniaccept-python.git@2fd43061c729fdd834b93ee64ea33695266ddae0#egg=uniaccept-master (from -r requirements.txt (line 15))
  Cloning https://github.com/icann/uniaccept-python.git (to 2fd43061c729fdd834b93ee64ea33695266ddae0) to ./venv/src/uniaccept-master
  Could not find a tag or branch '2fd43061c729fdd834b93ee64ea33695266ddae0', assuming commit.
  Running setup.py (path:/home/kmaxwell/src/combine/venv/src/uniaccept-master/setup.py) egg_info for package uniaccept-master
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/home/kmaxwell/src/combine/venv/src/uniaccept-master/setup.py", line 49, in <module>
        main()
      File "/home/kmaxwell/src/combine/venv/src/uniaccept-master/setup.py", line 20, in main
        from uniaccept import __version__
      File "uniaccept/__init__.py", line 1, in <module>
        from uniaccept.core import *
      File "uniaccept/core.py", line 50, in <module>
        import dns.resolver
    ImportError: No module named dns.resolver
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/home/kmaxwell/src/combine/venv/src/uniaccept-master/setup.py", line 49, in <module>

    main()

  File "/home/kmaxwell/src/combine/venv/src/uniaccept-master/setup.py", line 20, in main

    from uniaccept import __version__

  File "uniaccept/__init__.py", line 1, in <module>

    from uniaccept.core import *

  File "uniaccept/core.py", line 50, in <module>

    import dns.resolver

ImportError: No module named dns.resolver

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /home/kmaxwell/src/combine/venv/src/uniaccept-master
Storing debug log for failure in /home/kmaxwell/.pip/pip.log
krmaxwell commented 9 years ago

This is odd because dnspython provides it, and it's in the requirements.txt file before it. In fact, I even see in the logs:

Downloading/unpacking dnspython==1.12.0 (from -r requirements.txt (line 14))
  Downloading dnspython-1.12.0.zip (230kB): 230kB downloaded
  Running setup.py (path:/home/kmaxwell/src/combine/venv/build/dnspython/setup.py) egg_info for package dnspython

Even more oddly, if I manually run pip install dnspython then re-run pip install -r requirements.txt, then it works. And it's the same version. But it fully installs:

(venv)kmaxwell@newton:~/src/combine(sooshie-master)$ pip install dnspythonDownloading/unpacking dnspython
  Downloading dnspython-1.12.0.zip (230kB): 230kB downloaded
  Running setup.py (path:/home/kmaxwell/src/combine/venv/build/dnspython/setup.py) egg_info for package dnspython

Installing collected packages: dnspython
  Running setup.py install for dnspython

Successfully installed dnspython
Cleaning up...

Any ideas what's wrong with the setup? It looks like requirements.txt should work.

sooshie commented 9 years ago

I thought maybe it was the order of requirements.txt, but that doesn't make any sense. Really stupid idea time, what if it was put higher in requirements.txt? I'm clearly grasping at straws.

krmaxwell commented 9 years ago

Tried moving it to the top, that didn't help.

This is dependency hell at its worst. It won't actually install until it's compiled everything but it won't compile because the dependencies haven't installed. I'm researching.

krmaxwell commented 9 years ago

For now, I've manually installed dnspython first, then run pip install, but we really need to figure this out before merging.

sooshie commented 9 years ago

Agreed, totally weird. Although it seems like more of a Python/pip issue vs. an inherent code issue. Either way this is now running fine for me at $dayjob.

alexcpsec commented 9 years ago

Options to uniaccept? What are we using it for? Could it be replaced by tldextract?

sooshie commented 9 years ago

It's used for TLD validation (instead of a regex is downloads a list from ICANN), seems a bit more accurate/easier to maintain long term. We could probably write our own around tldextract to do the extraction and then verification. I just went the lazy route since I knew uniaccept worked.

alexcpsec commented 9 years ago

Error that happens on CRITs export. Needs to update the way the data is handles on the output functions in baler to use the new data structure.

[2015-03-08 16:04:58.240750] INFO: thresher: Parsing feed from http://www.openbl.org/lists/base_30days.txt
[2015-03-08 16:04:58.271716] INFO: thresher: Storing parsed data in crop.json
[2015-03-08 16:05:32.525378] INFO: baler: Reading processed data from crop.json
Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/data/sooshie-master/baler.py", line 168, in bale_CRITs_indicator
    if indicator[1] == 'IPv4':
KeyError: 1
sooshie commented 9 years ago

Good call. I'll try and get it cleaned up. Although I don't have a CRITS instance to test against.

krmaxwell commented 9 years ago

If @alexcpsec doesn't, you should talk to @sroberts. I know he has one.

sroberts commented 9 years ago

@sooshie It's actually pretty easy to get up and running at this point, the shell script is great for that.

sooshie commented 9 years ago

Ok, got it fixed. Still haven't tested it against a CRITS instance (because $dayjob has other priorities currently). But it was an easy fix. The function was still using the CSV fields vs the JSON I used for re-plumbing. Somebody might check that I used source and reference correctly. Other than that, no errors on running it.

sooshie commented 9 years ago

Don't forget my initial note (all the way at the top), I haven't touched the tiq part of the code, and likely won't for the foreseeable future. But it looks like it shouldn't need it since it relies on functions I've already fixed.

krmaxwell commented 9 years ago

There are two fixes to the dnspython/uniaccept issue, the "right" way and the "quick" way.

Right: Fix uniaccept so that it's a true package. I don't know if they'd accept a PR, but since it's BSD licensed we can do just about anything we need to do. This probably involves configuring Combine with setuptools as well.

Quick: Import our fork of uniaccept as a git subtree and update the install instructions in the README.

For now I have gone with the "quick" option so we can get this done. We really need to do it the "right" way at some point. But hey, technical debt is our friend!

krmaxwell commented 9 years ago

techdebt from @alexcpsec

alexcpsec commented 9 years ago

Another one for the list. It seems the thread is not ending processing when there is an ERROR and then the program just waits forever.

[2015-03-10 23:12:10.017235] DEBUG: reaper: Added: http://reputation.alienvault.com/reputation.data
[2015-03-10 23:12:17.232505] ERROR: reaper: Requests Error: HTTPConnectionPool(host='support.clean-mx.de', port=80): Read timed out. (read timeout=7.0)

^CTraceback (most recent call last):
  File "combine.py", line 40, in <module>
    reap('harvest.json')
  File "/Users/alexcp/src/combine/reaper.py", line 80, in reap
    responses = [q.get() for q in queues]
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 117, in get
    res = self._recv()
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/util.py", line 325, in _exit_function
    p.join()
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 145, in join
    res = self._popen.wait(timeout)
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
krmaxwell commented 9 years ago

OK good it's not just me then. (I let it run for over 45 minutes :( )

We can also disable PalevoTracker (404s) and SpyeyeTracker (no longer active).

alexcpsec commented 9 years ago

Typo on Palevo. Should be blocklists.php (for some reason).

SpyEye is dead, can be removed.

krmaxwell commented 9 years ago
(venv)kmaxwell@leibniz:~/src/combine(sooshie-master)$ python reaper.py
[2015-03-11 03:16:21.246795] INFO: reaper: Loading Plugins
[2015-03-11 03:16:21.285367] INFO: reaper: Processing: sans
[snip]
[2015-03-11 03:16:33.723057] ERROR: reaper: Requests Error: HTTPConnectionPool(host='support.clean-mx.de', port=80): Read timed out. (read timeout=7.0)
Process Process-26:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "reaper.py", line 32, in get_file
    q.task_done()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 330, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
krmaxwell commented 9 years ago

FWIW this runs successfully for me now (see the sooshie-master branch in this repo). There are a few minor things to tweak here but we're just about there.

krmaxwell commented 9 years ago

OK, this is in dev now and we should continue to iterate on issues from there. We will document those separately.