Open ollipasanen opened 4 years ago
Yup! You're completely correct here. Could be optimized in various ways. Current version is direct conversion of a Python project very similar (but much much slower) functionality with just as naive lookup strategy. (Iterating through the entire list with no particular optimizations).
Current memory footprint is some tens of megabytes and the load time is extremely fast. I'm very much ok spending that amount of memory and just hit disk once on load. Spending some more memory is fine too.
There's currently a big imbalance between sizes of the input files and the following data structures and lookups to are not really very good to do naive parallelize like this.
In commit 29ec5f8826508724b8dce05292e92b11c63ab8d1 I started the work to be able to split ips and nets to make more faster lookups because treating everything as a net is slower and it's easier to optimize on those more specific data types. Should continue on that. =)
This happens to be my first try at writing a Rust program so all improvements very much welcome. (Also to any other things than figuring out better lookup algorithms). Pull requests welcome too naturally.
Speed up improvement thoughts and ideas
I ran the benchmark and the query speed seems to be currently about 250 queries/s. This is not super slow but querying big lists (10k+) already becomes a bit sluggish. I got some ideas about improving perf.
There are 4 main different types of queries in the code:
This is currently done by comparing the query IP to all known IPs in all lists in parallel which is O(n). You could pre-sort the ipsets when updating the repo (considering IPs as just unsigned ints) and then binary search these in O(log n) without increasing the memory footprint. Considering how big some of the lists are, this might be a very good addition.
With a disk kv-storage this query would be ~O(1) (see things below).
This is also O(n) comparing all netsets to see if the IP is contained in the range. This can be queried in constant time from a hashmap, but since hashmaps have large memory footprints a disk key-value-storage would probably be required. Iterating all possible prefixes (0..32) and checking if (net, block) exists in the kv-storage would tell you if it's in a block. For example if you have IP
1.2.3.4
you could check:Is
(1.2.3.4, 32)
in the storage? Is(1.2.3.4, 31)
in the storage? Is(1.2.3.0, 30)
in the storage? ... Is(0.0.0.0/0)
in the storage?If the kv-storage is fast, this is also constant time (32 * constant query).
Another possibility is to use an interval/segment tree to store all known ip blocks and then query the existence of a single element in the range. Requires at least O(log(n)) more memory but queries should also be O(log n).
Same tricks could be used as above.
Binary search would also work here. If you sort all ipsets, given the prefix as (min, max) integer range, you can search the sorted ipset for the first IP >= min and then check if it's also <= max. Again O(n) => O(log n) improvement.
Disk-storage
Another idea I had was to use a very efficient key-value storage on disk such as CDB. This db would be updated when pulling the repo. Keys would be ips and ip,block -pairs. The DB would provide fast queries to the three first use-cases without increasing memory footprint but requires more effort and disk space.
Maybe also disk / memory query cache which is reset when pulling the repo?
Prefilter queries (e.g. Bloomfilter)
Before even starting to calculate the criteria above, you could limit the ipsets and netsets to be checked by precalculating some things like given ip/net prefix 000010101... no set contains any ips or nets starting with this prefix. These could be calculated when loading the files or updating the db and stored on disk. Quite a lot effort for not so much perf boost but yeah..
Just some ideas :)