lamehost / aggregate-prefixes

Fast IPv4 and IPv6 prefix aggregator written in pure Python (no dependency required)
MIT License
27 stars 3 forks source link

(talk) performance #5

Closed elandorr closed 1 year ago

elandorr commented 1 year ago

Since dealing with large (100k+) sets I was looking to gain speed.

PyPy works for this module out of the box, and is roughly 4x faster at best. (varies of course, stabilized at 2x reliable it seems) That's a big recommendation from me.

But what I found insane is iprange, here a comparison:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07 Maximum resident set size (kbytes): 2492

compared to my optimized PyPy build with aggregate-prefixes:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64 22x Maximum resident set size (kbytes): 153872 62x

compared to my optimized CPython build with the same:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.28 61x Maximum resident set size (kbytes): 100668 40x

Is this just a limitation of Python, or does iprange have some magical algorithm?

What's especially interesting to me is how it handles it with very minimal memory. During the cron'd processing, I try to avoid large spikes, and memory usage seems to increase rapidly the bigger your sets.

aggregate-prefixes is my new standard tool, but I felt it may be of use to look at algos. Even just for fun if your're bored.

elandorr commented 1 year ago

As I had to test it, here for everyone else:

A step = ~10k lines with large overlaps. Final result is 100k +-15%. So batching ~100k at once is the sweetspot with such a dataset, before it goes berzerk. 9qHQuTIoAW

Beyond that, get your swap ready to rumble.

lamehost commented 1 year ago

I've looked into memory consumption with memray and here's the results:

$ memray tree memray-aggregate_prefixes.864478.bin

Allocation metadata
-------------------
Command line arguments: '/home/marco/aggregate-prefixes/venv/lib/python3.10/site-packages/memray/__main__.py run -m aggregate_prefixes'
Peak memory size: 360.593MB
Number of allocations: 14660

Biggest 10 allocations:
-----------------------
📂 331.659MB (100.00 %) <ROOT>  
└── [[4 frames hidden in 2 file(s)]]
    └── 📂 331.659MB (100.00 %) <module>  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:152
        ├── 📂 286.659MB (86.43 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:147
        │   ├── [[1 frames hidden in 1 file(s)]]
        │   │   └── 📂 159.000MB (47.94 %) ip_network  /usr/lib/python3.10/ipaddress.py:74
        │   │       ├── [[2 frames hidden in 1 file(s)]]
        │   │       │   └── 📄 131.000MB (39.50 %) _ip_int_from_string  /usr/lib/python3.10/ipaddress.py:1195
        │   │       └── 📄 28.000MB (8.44 %) __init__  /usr/lib/python3.10/ipaddress.py:1505
        │   ├── 📂 75.000MB (22.61 %) aggregate_prefixes  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:199
        │   │   ├── [[3 frames hidden in 3 file(s)]]
        │   │   │   └── 📄 69.000MB (20.80 %) __get__  /usr/lib/python3.10/functools.py:983
        │   │   └── [[1 frames hidden in 1 file(s)]]
        │   │       └── 📄 6.000MB (1.81 %) __add__  /usr/lib/python3.10/ipaddress.py:599
        │   ├── 📂 47.000MB (14.17 %) aggregate_prefixes  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:200
        │   │   ├── [[3 frames hidden in 3 file(s)]]
        │   │   │   └── 📄 23.000MB (6.93 %) __get__  /usr/lib/python3.10/functools.py:983
        │   │   ├── [[3 frames hidden in 2 file(s)]]
        │   │   │   └── 📄 14.000MB (4.22 %) _string_from_ip_int  /usr/lib/python3.10/ipaddress.py:1246
        │   │   └── [[4 frames hidden in 2 file(s)]]
        │   │       └── 📄 10.000MB (3.02 %) _ip_int_from_string  /usr/lib/python3.10/ipaddress.py:1195
        │   └── 📄 5.659MB (1.71 %) aggregate_prefixes  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:179
        ├── 📄 26.000MB (7.84 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:118
        └── 📄 19.000MB (5.73 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:117

Not surprisingly, ipaddress generates most of the load as aggregate-prefixes heavily rely on it. It is also interesting to see how it trades CPU cycles for memory by using functools.

lamehost commented 1 year ago

Anyway, thanks for raising it. I would have never handled ed1018d without you

elandorr commented 1 year ago

It is also interesting to see how it trades CPU cycles for memory by using functools.

Interesting. I don't know enough and haven't read the ipaddress code to judge, but pure caching shouldn't cause it to go exponential. We only need the full list once, and depending on the sorting algo maybe n times, but in either case O(n) would seem logical. iprange can do it too, with a thousand times less after all.

wc -l monster2.lst 26210140 monster2.lst

\time -v iprange monster2.lst >/dev/null Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.06 Maximum resident set size (kbytes): 411112

iprange is perfectly linear. Do note this is just the same source list multiplied, so the end result is the same. It's even somewhat surprising there is a linear increase, given how it should never need more than what one full list requires, as it could be deduped after every copy.

(optimized custom pypy build with lto and native) \time -v aggregate-prefixes monster2.lst >/dev/null Elapsed (wall clock) time (h:mm:ss or m:ss): 9:19.84 Maximum resident set size (kbytes): 8880792

It ate every last byte and inhaled the swap like the cookie monster. Interestingly enough, it did avoid to tickle the OOM killer, that really surprises me. My experience with the OOM algo is that it kills first and asks questions later, but python did avoid that. Not sure anything else got killed, hope not. It was somewhat dumb to do, though, we just had a full blown downtime during that run, it froze everything so hard, not even a ping made it through.

I'm probably just overthinking this and should stop wasting your time. nft is already taking a while to eat ~120k, not to mention even just dumping counters takes forever. (Why on earth would it seemingly go through everything just to dump a few counts? Seems odd coming from high end kernel geniuses.) More prefixes make no sense for normal people. We can batch the process at maximum niceness and accept a few hundred MB RAM temporarily. I just come from a different kind of ideology and always admired embedded people and the demoscene, so my first thought was, 'how do I get this down'. It's 2022 and a popular tool to dd a USB key eats 200M, even if I refuse to accept that.

I also don't know IP math to even try to improve such algos. You are the IP expert here and I'm grateful you shared a useful tool handling the edge cases. We won't re-write py's ipaddress. Glad I can deal with IPv4 basics to run what we need. About IPv6, a short resume: https://dmca.fileditch.com/ipv6.mp4.

I would have never handled ed1018d without you

Neat!

Happy Christmas! Hope you have some peace amongst the scheduled consumerism.

lamehost commented 1 year ago

Hello,

Thank you for the insights.

I have drafted something that doesn't use ipaddress and rely on self-made code instead. Results are good, but not great:

$ python3 -m memray tree memray-aggregate_prefixes.905222.bin 

Allocation metadata
-------------------
Command line arguments: '/home/marco/aggregate-prefixes/venv/lib/python3.10/site-packages/memray/__main__.py run -m aggregate_prefixes'
Peak memory size: 303.620MB
Number of allocations: 6935

Biggest 10 allocations:
-----------------------
📂 282.023MB (100.00 %) <ROOT>  
└── [[4 frames hidden in 2 file(s)]]
    └── 📂 282.023MB (100.00 %) <module>  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:152
        ├── 📂 237.023MB (84.04 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:147
        │   ├── 📂 195.000MB (69.14 %) aggregate_prefixes  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:271
        │   │   ├── 📄 60.000MB (21.27 %) __init__  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:53
        │   │   └── 📂 108.000MB (38.29 %) __init__  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:46
        │   │       ├── 📄 40.000MB (14.18 %) parse_ipv4  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:82
        │   │       ├── 📄 31.000MB (10.99 %) parse_ipv4  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:75
        │   │       ├── 📄 23.000MB (8.16 %) parse_ipv4  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:76
        │   │       ├── 📄 7.000MB (2.48 %) parse_ipv4  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:81
        │   │       └── 📄 7.000MB (2.48 %) parse_ipv4  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:80
        │   └── [[1 frames hidden in 1 file(s)]]
        │       └── 📄 42.023MB (14.90 %) <lambda>  /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:287
        ├── 📄 28.000MB (9.93 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:118
        └── 📄 17.000MB (6.03 %) main  /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:117
(venv) marco@lilith:~/aggregate-prefixes$ git diff

It's definitely an improvement, but not life changing and the trade-off is bad anyway. I like aggregate-prefixes returns standard python3 objects. And i don't think we should give that up for memory consumption