mlsecproject / combine

Tool to gather Threat Intelligence indicators from publicly available sources
https://www.mlsecproject.org/
GNU General Public License v3.0
653 stars 179 forks source link

Speed up enrichment #42

Closed krmaxwell closed 9 years ago

krmaxwell commented 10 years ago

We perform this process each time an indicator appears rather than using the data for the same core indicator from earlier in the same run. In other works, if four different feeds each list 8.8.8.8, we perform that lookup four times. This is inefficient.

Instead, we should create a list of all the unique indicators in the current dataset, enrich those, and then map back to the original data.

krmaxwell commented 10 years ago

Side note: this especially matters for the DNSDB type enrichment because of the latency involved in network lookups.

alexcpsec commented 10 years ago

There is probably a good "dict-like" implementation for this. In R, I have a fancy structure that does these lookups very very fast (for ASN ranges), so we should be able to find something similar for Python

alexcpsec commented 10 years ago

And by "I have a fancy structure", I mean I use http://datatable.r-forge.r-project.org/

krmaxwell commented 9 years ago

Leaving aside the DNSDB stuff, I profiled the existing enrichment.

(venv)kmaxwell@newton:~/src/combine$ python -m cProfile -s cumulative winnower.py
2014-12-23 19:46:18,537 - combine.winnower - INFO - Enriching IPv4 indicators: TRUE
2014-12-23 19:46:18,537 - combine.winnower - INFO - Enriching DNS indicators: FALSE
2014-12-23 19:46:18,537 - combine.winnower - INFO - Setting up DNSDB client
HTTP Error 403: forbidden
2014-12-23 19:46:19,271 - combine.winnower - INFO - Invalid DNSDB configuration found
2014-12-23 19:46:20,081 - combine.winnower - INFO - Loading GeoIP data
2014-12-23 19:46:24,819 - combine.winnower - INFO - Beginning winnowing process
2014-12-23 19:46:24,820 - combine.winnower - INFO - Enriching <SNIPPED FOR BREVITY>
^C         116947266 function calls (116946322 primitive calls) in 47.539 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   47.539   47.539 winnower.py:2(<module>)
        1    0.004    0.004   47.509   47.509 winnower.py:104(winnow)
      313    0.002    0.000   41.064    0.131 winnower.py:54(enrich_IPv4)
      313   10.432    0.033   40.989    0.131 winnower.py:33(org_by_addr)
 36343752   25.172    0.000   30.555    0.000 __init__.py:1364(__contains__)
 74007381    5.641    0.000    5.641    0.000 {isinstance}

This makes it pretty clear to me that the real problem here lies in org_by_addr(), which is the AS lookup. That makes sense to me because this code is terrible and we should immediately :gun: whoever wrote it -- wait, belay that.

Anyway, the code in question is in fact an O(n) search through a list of those nasty IPRange data structures. Using a proper data structure here should take us to something much more O(1)-like.

alexcpsec commented 9 years ago

I have something on my R codebase that is O(log n), but I am sure there should be a magical dict that should help with the range searches.

alexcpsec commented 9 years ago

Fixed by #103