Closed elandorr closed 1 year ago
As I had to test it, here for everyone else:
A step = ~10k lines with large overlaps. Final result is 100k +-15%. So batching ~100k at once is the sweetspot with such a dataset, before it goes berzerk.
Beyond that, get your swap ready to rumble.
I've looked into memory consumption with memray and here's the results:
$ memray tree memray-aggregate_prefixes.864478.bin
Allocation metadata
-------------------
Command line arguments: '/home/marco/aggregate-prefixes/venv/lib/python3.10/site-packages/memray/__main__.py run -m aggregate_prefixes'
Peak memory size: 360.593MB
Number of allocations: 14660
Biggest 10 allocations:
-----------------------
📂 331.659MB (100.00 %) <ROOT>
└── [[4 frames hidden in 2 file(s)]]
└── 📂 331.659MB (100.00 %) <module> /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:152
├── 📂 286.659MB (86.43 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:147
│ ├── [[1 frames hidden in 1 file(s)]]
│ │ └── 📂 159.000MB (47.94 %) ip_network /usr/lib/python3.10/ipaddress.py:74
│ │ ├── [[2 frames hidden in 1 file(s)]]
│ │ │ └── 📄 131.000MB (39.50 %) _ip_int_from_string /usr/lib/python3.10/ipaddress.py:1195
│ │ └── 📄 28.000MB (8.44 %) __init__ /usr/lib/python3.10/ipaddress.py:1505
│ ├── 📂 75.000MB (22.61 %) aggregate_prefixes /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:199
│ │ ├── [[3 frames hidden in 3 file(s)]]
│ │ │ └── 📄 69.000MB (20.80 %) __get__ /usr/lib/python3.10/functools.py:983
│ │ └── [[1 frames hidden in 1 file(s)]]
│ │ └── 📄 6.000MB (1.81 %) __add__ /usr/lib/python3.10/ipaddress.py:599
│ ├── 📂 47.000MB (14.17 %) aggregate_prefixes /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:200
│ │ ├── [[3 frames hidden in 3 file(s)]]
│ │ │ └── 📄 23.000MB (6.93 %) __get__ /usr/lib/python3.10/functools.py:983
│ │ ├── [[3 frames hidden in 2 file(s)]]
│ │ │ └── 📄 14.000MB (4.22 %) _string_from_ip_int /usr/lib/python3.10/ipaddress.py:1246
│ │ └── [[4 frames hidden in 2 file(s)]]
│ │ └── 📄 10.000MB (3.02 %) _ip_int_from_string /usr/lib/python3.10/ipaddress.py:1195
│ └── 📄 5.659MB (1.71 %) aggregate_prefixes /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:179
├── 📄 26.000MB (7.84 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:118
└── 📄 19.000MB (5.73 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:117
Not surprisingly, ipaddress
generates most of the load as aggregate-prefixes heavily rely on it. It is also interesting to see how it trades CPU cycles for memory by using functools.
Anyway, thanks for raising it. I would have never handled ed1018d without you
It is also interesting to see how it trades CPU cycles for memory by using functools.
Interesting. I don't know enough and haven't read the ipaddress code to judge, but pure caching shouldn't cause it to go exponential. We only need the full list once, and depending on the sorting algo maybe n times, but in either case O(n) would seem logical. iprange can do it too, with a thousand times less after all.
wc -l monster2.lst 26210140 monster2.lst
\time -v iprange monster2.lst >/dev/null Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.06 Maximum resident set size (kbytes): 411112
iprange is perfectly linear. Do note this is just the same source list multiplied, so the end result is the same. It's even somewhat surprising there is a linear increase, given how it should never need more than what one full list requires, as it could be deduped after every copy.
(optimized custom pypy build with lto and native) \time -v aggregate-prefixes monster2.lst >/dev/null Elapsed (wall clock) time (h:mm:ss or m:ss): 9:19.84 Maximum resident set size (kbytes): 8880792
It ate every last byte and inhaled the swap like the cookie monster. Interestingly enough, it did avoid to tickle the OOM killer, that really surprises me. My experience with the OOM algo is that it kills first and asks questions later, but python did avoid that. Not sure anything else got killed, hope not. It was somewhat dumb to do, though, we just had a full blown downtime during that run, it froze everything so hard, not even a ping made it through.
I'm probably just overthinking this and should stop wasting your time. nft is already taking a while to eat ~120k, not to mention even just dumping counters takes forever. (Why on earth would it seemingly go through everything just to dump a few counts? Seems odd coming from high end kernel geniuses.) More prefixes make no sense for normal people. We can batch the process at maximum niceness and accept a few hundred MB RAM temporarily. I just come from a different kind of ideology and always admired embedded people and the demoscene, so my first thought was, 'how do I get this down'. It's 2022 and a popular tool to dd a USB key eats 200M, even if I refuse to accept that.
I also don't know IP math to even try to improve such algos. You are the IP expert here and I'm grateful you shared a useful tool handling the edge cases. We won't re-write py's ipaddress. Glad I can deal with IPv4 basics to run what we need. About IPv6, a short resume: https://dmca.fileditch.com/ipv6.mp4.
I would have never handled ed1018d without you
Neat!
Happy Christmas! Hope you have some peace amongst the scheduled consumerism.
Hello,
Thank you for the insights.
I have drafted something that doesn't use ipaddress
and rely on self-made code instead. Results are good, but not great:
$ python3 -m memray tree memray-aggregate_prefixes.905222.bin
Allocation metadata
-------------------
Command line arguments: '/home/marco/aggregate-prefixes/venv/lib/python3.10/site-packages/memray/__main__.py run -m aggregate_prefixes'
Peak memory size: 303.620MB
Number of allocations: 6935
Biggest 10 allocations:
-----------------------
📂 282.023MB (100.00 %) <ROOT>
└── [[4 frames hidden in 2 file(s)]]
└── 📂 282.023MB (100.00 %) <module> /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:152
├── 📂 237.023MB (84.04 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:147
│ ├── 📂 195.000MB (69.14 %) aggregate_prefixes /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:271
│ │ ├── 📄 60.000MB (21.27 %) __init__ /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:53
│ │ └── 📂 108.000MB (38.29 %) __init__ /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:46
│ │ ├── 📄 40.000MB (14.18 %) parse_ipv4 /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:82
│ │ ├── 📄 31.000MB (10.99 %) parse_ipv4 /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:75
│ │ ├── 📄 23.000MB (8.16 %) parse_ipv4 /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:76
│ │ ├── 📄 7.000MB (2.48 %) parse_ipv4 /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:81
│ │ └── 📄 7.000MB (2.48 %) parse_ipv4 /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:80
│ └── [[1 frames hidden in 1 file(s)]]
│ └── 📄 42.023MB (14.90 %) <lambda> /home/marco/aggregate-prefixes/aggregate_prefixes/aggregate_prefixes.py:287
├── 📄 28.000MB (9.93 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:118
└── 📄 17.000MB (6.03 %) main /home/marco/aggregate-prefixes/aggregate_prefixes/__main__.py:117
(venv) marco@lilith:~/aggregate-prefixes$ git diff
It's definitely an improvement, but not life changing and the trade-off is bad anyway. I like aggregate-prefixes
returns standard python3 objects. And i don't think we should give that up for memory consumption
Since dealing with large (100k+) sets I was looking to gain speed.
PyPy works for this module out of the box, and is roughly 4x faster at best. (varies of course, stabilized at 2x reliable it seems) That's a big recommendation from me.
But what I found insane is iprange, here a comparison:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07 Maximum resident set size (kbytes): 2492
compared to my optimized PyPy build with aggregate-prefixes:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64 22x Maximum resident set size (kbytes): 153872 62x
compared to my optimized CPython build with the same:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.28 61x Maximum resident set size (kbytes): 100668 40x
Is this just a limitation of Python, or does iprange have some magical algorithm?
What's especially interesting to me is how it handles it with very minimal memory. During the cron'd processing, I try to avoid large spikes, and memory usage seems to increase rapidly the bigger your sets.
aggregate-prefixes is my new standard tool, but I felt it may be of use to look at algos. Even just for fun if your're bored.