kfeldmann / cidrmerge

Merge and de-dupe overlapping and adjacent IP address ranges (CIDRs).
BSD 3-Clause "New" or "Revised" License
32 stars 9 forks source link

Fails with ip lists > 4k lines #4

Open HawaiiWes68 opened 3 years ago

HawaiiWes68 commented 3 years ago

I was hoping to use this with ip blocking, taking lists of IPs by country and merging them all together - but it seems that it gets to a point where it uses up 99.9% of available processor and 250mb ram, and then basically stops. My guess is the objects/arrays fill up to the max memory/processor point, and then it basically has no where to go. I'll could try to break up the IPs into smaller sets (1k lines?), then merge them back into each other one at a time. But then still it's going to reach a point where the total is > 4k lines and break. Any change you'd want to rewrite this so that it works on larger data sets?

kfeldmann commented 3 years ago

Thank you for your feedback. I have not tested with a list of that size, but I think I will now since you've peaked my curiosity.

One thing to keep in mind, is that cidrmerge will take longer to process a larger list. It compares each cidr to every other cidr in the list, and repeats the loop until no more changes can be made. It might be that cidrmerge has not crashed, but is taking longer than you expect. You can turn on debugging output by setting the environment variable DEBUG. With debugging on, there should be a constant stream of debugging output if cidrmerge is still working. If cidrmerge "locks up" then you will notice the debugging output will stop being printed.

Example: cat biglist | DEBUG=1 ./cidrmerge

I'll try with a larger list and let you know what I find.

HawaiiWes68 commented 3 years ago

Hey-

For what it's worth - I would think that actually compiling all IPs into one file first, do a NATSORT, so that it does not have to compare every IP to every other one. Or at least it can be done with one set when it reaches the next set of numbers. I did let it set for a long time crunching on the data, and my server is new-new. Also, I would like it to merge IPs, if two are less than X numbers apart, likely they are also in that same set. I'm pretty sure that would reduce the by-country IP sets by surely 70% or more. For sure, I don't want/need it to be listing a ton of /24's, maybe only include /16's and larger sets.

Basically I'm using this to firewall block most of the planet who are non-stop attacking every other IP on the planet. Just want to include English speaking countries and Mexico since my software/service will only be used in those possible areas.

And, I'm not suggesting you make these changes - I'll likely have to build this myself at some point. But I think the first sentence would help your cidrmerge - so you can clear out the arrays as you go, reducing the memory required, and maybe even processing required?

Thanks,

Wesley Owens (He/Him) Developer, Owner

Volunteer Matrix © 2000-2021 phone: (808) 982-4174 https://www.VolunteerMatrix.com

/And now, saving lives by helping people avoid getting Covid-19/ https://IsoQueue.com

100% US Veteran owned business. Our office electric is run on 100% solar power. On 3/30/2021 10:32 AM, kfeldmann wrote:

Thank you for your feedback. I have not tested with a list of that size, but I think I will now since you've peaked my curiosity.

One thing to keep in mind, is that cidrmerge will take longer to process a larger list. It compares each cidr to every other cidr in the list, and repeats the loop until no more changes can be made. It might be that cidrmerge has not crashed, but is taking longer than you expect. You can turn on debugging output by setting the environment variable |DEBUG|. With debugging on, there should be a constant stream of debugging output if cidrmerge is still working. If cidrmerge "locks up" then you will notice the debugging output will stop being printed.

Example: |cat biglist | DEBUG=1 ./cidrmerge|

I'll try with a larger list and let you know what I find.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kfeldmann/cidrmerge/issues/4#issuecomment-810445325, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATMBELYM67OFNEVOZR2QC5LTGIDMVANCNFSM4ZWEPG4A.

kfeldmann commented 3 years ago

I was able to merge 10,000 distinct IPs into 2,000 cidrs in just under 14 minutes on an AWS t2.micro (which also had other tasks running during that time).

Here's how I tested. I wrote a script to generate IP addresses:

#!/bin/sh

for A in $(seq 1 10)
do
    for B in $(seq 0 9)
    do
        for C in $(seq 0 9)
        do
            for D in $(seq 0 9)
            do
                echo "${A}.${B}.${C}.${D}/32"
            done
        done
    done
done

This generates 10,000 distinct addresses, in runs of 10 consecutive addresses at a time. The data looks like this:

1.0.0.0/32
1.0.0.1/32
1.0.0.2/32
1.0.0.3/32
1.0.0.4/32
1.0.0.5/32
1.0.0.6/32
1.0.0.7/32
1.0.0.8/32
1.0.0.9/32
1.0.1.0/32
1.0.1.1/32
1.0.1.2/32
...

Cidrmerge is able to merge them like this:

1.0.0.0/29
1.0.0.8/31
1.0.1.0/29
1.0.1.8/31
...

The CPU usage was pegged on one core for the whole time, as you might expect. The memory usage was actually quite low. I didn't watch the whole time, but the resident state slowly climbed to about 9 MB while I was watching, so it doesn't seem that memory is an issue at all.

Thank you for your optimization recommendations. I'll keep those for the future. So far, the focus has been to make sure that cidrmerge is thorough and finds every possible merge.

I understand your use-case, but it's not (at this time) a goal of cidrmerge. The focus has been for cidrmerge to be accurate. The output should represent exactly the same IP address space as the input (de-duped and expressed in as few cidrs as possible).

I'm thinking that since I was able to merge 10,000 IPs, this issue (fails with >4k lines) can be closed. Do you have any specific input data that produces different results than mine (hangs, crashes, etc.)?

Thank you again for providing feedback and advice.

Heidistein commented 1 year ago

I have changed the merge-loop a bit. I will create a pullrequest.

I was able to churn through a 1.4M lines list in just under 6 minutes, reducing it to 104K lines. I also need it for ipv6, will study on that too.

Heidistein commented 1 year ago

Right. I am sorry. It became a complete rewrite. Please see what you want to reuse:

https://gist.github.com/Heidistein/114baada5dd02cc174398d77418c2afd

if you {apt,yum} install bgpq4, create a list (add -6 for ipv6): bgpq4 AS-FRYS-IX-CONNECTED -F '%n/%l\n' > tmp/bgp/frysix

Other large (guaranteed fragmented) AS-sets are AS-NL-IX-RS, AS-AMS-IX-RS or any other route-server as-macro

kfeldmann commented 1 year ago

Awesome. Thank you for sharing. I will dig into this when I have time. Thanks also for the tip about creating large test sets using bgp data.