Closed krmaxwell closed 9 years ago
Side note: this especially matters for the DNSDB type enrichment because of the latency involved in network lookups.
There is probably a good "dict-like" implementation for this. In R, I have a fancy structure that does these lookups very very fast (for ASN ranges), so we should be able to find something similar for Python
And by "I have a fancy structure", I mean I use http://datatable.r-forge.r-project.org/
Leaving aside the DNSDB stuff, I profiled the existing enrichment.
(venv)kmaxwell@newton:~/src/combine$ python -m cProfile -s cumulative winnower.py
2014-12-23 19:46:18,537 - combine.winnower - INFO - Enriching IPv4 indicators: TRUE
2014-12-23 19:46:18,537 - combine.winnower - INFO - Enriching DNS indicators: FALSE
2014-12-23 19:46:18,537 - combine.winnower - INFO - Setting up DNSDB client
HTTP Error 403: forbidden
2014-12-23 19:46:19,271 - combine.winnower - INFO - Invalid DNSDB configuration found
2014-12-23 19:46:20,081 - combine.winnower - INFO - Loading GeoIP data
2014-12-23 19:46:24,819 - combine.winnower - INFO - Beginning winnowing process
2014-12-23 19:46:24,820 - combine.winnower - INFO - Enriching <SNIPPED FOR BREVITY>
^C 116947266 function calls (116946322 primitive calls) in 47.539 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 47.539 47.539 winnower.py:2(<module>)
1 0.004 0.004 47.509 47.509 winnower.py:104(winnow)
313 0.002 0.000 41.064 0.131 winnower.py:54(enrich_IPv4)
313 10.432 0.033 40.989 0.131 winnower.py:33(org_by_addr)
36343752 25.172 0.000 30.555 0.000 __init__.py:1364(__contains__)
74007381 5.641 0.000 5.641 0.000 {isinstance}
This makes it pretty clear to me that the real problem here lies in org_by_addr()
, which is the AS lookup. That makes sense to me because this code is terrible and we should immediately :gun: whoever wrote it -- wait, belay that.
Anyway, the code in question is in fact an O(n)
search through a list of those nasty IPRange data structures. Using a proper data structure here should take us to something much more O(1)
-like.
I have something on my R codebase that is O(log n)
, but I am sure there should be a magical dict that should help with the range searches.
Fixed by #103
We perform this process each time an indicator appears rather than using the data for the same core indicator from earlier in the same run. In other works, if four different feeds each list 8.8.8.8, we perform that lookup four times. This is inefficient.
Instead, we should create a list of all the unique indicators in the current dataset, enrich those, and then map back to the original data.