python / cpython

The Python programming language
https://www.python.org
Other
61.89k stars 29.77k forks source link

Better exclude_address function in ipaddress.py #97610

Open ankenyr opened 1 year ago

ankenyr commented 1 year ago

Feature or enhancement

Increase the speed of exclude_address by treating IPs as integer ranges making exclude_address O(1).

Pitch

Currently exclude_address will take a subnet and split it by calling subnets()

import ipaddress
>>> ip = ipaddress.IPv4Network('10.0.0.0/8')
>>> [i for i in ip.subnets()]
[IPv4Network('10.0.0.0/9'), IPv4Network('10.128.0.0/9')]

each time it is split, it checks which one contains the subnet being excluded and splits again. This ends up being a lot of work.

Instead I propose we can use the fact that IPs are represented as integers behind the scene

For a given subnet, it can be thought of as an integer range between its network and broadcast address

>>> ip.network_address._ip
167772160
>>> ip.broadcast_address._ip
184549375

Given these can be treated as an integer range, excluding is just removing the range integers you wish to exclude

>>> [i for i in ipaddress.summarize_address_range(
    ipaddress.IPv4Address(
        ip.network_address._ip),
        ipaddress.IPv4Address(exclude_ip.network_address._ip-1
))]
[IPv4Network('10.0.0.0/22')]

>>> [i for i in ipaddress.summarize_address_range(
    ipaddress.IPv4Address(
        exclude_ip.broadcast_address._ip)+1,
        ipaddress.IPv4Address(ip.broadcast_address._ip
))]
[IPv4Network('10.0.5.0/24'), IPv4Network('10.0.6.0/23'), IPv4Network('10.0.8.0/21'), IPv4Network('10.0.16.0/20'), IPv4Network('10.0.32.0/19'), IPv4Network('10.0.64.0/18'), IPv4Network('10.0.128.0/17'), IPv4Network('10.1.0.0/16'), IPv4Network('10.2.0.0/15'), IPv4Network('10.4.0.0/14'), IPv4Network('10.8.0.0/13'), IPv4Network('10.16.0.0/12'), IPv4Network('10.32.0.0/11'), IPv4Network('10.64.0.0/10'), IPv4Network('10.128.0.0/9')]
>>> sorted([i for i in ip.address_exclude(
    ipaddress.IPv4Network('10.0.4.0/24'
))])
[IPv4Network('10.0.0.0/22'), IPv4Network('10.0.5.0/24'), IPv4Network('10.0.6.0/23'), IPv4Network('10.0.8.0/21'), IPv4Network('10.0.16.0/20'), IPv4Network('10.0.32.0/19'), IPv4Network('10.0.64.0/18'), IPv4Network('10.0.128.0/17'), IPv4Network('10.1.0.0/16'), IPv4Network('10.2.0.0/15'), IPv4Network('10.4.0.0/14'), IPv4Network('10.8.0.0/13'), IPv4Network('10.16.0.0/12'), IPv4Network('10.32.0.0/11'), IPv4Network('10.64.0.0/10'), IPv4Network('10.128.0.0/9')]

Previous discussion

https://discuss.python.org/t/ipaddress-py-exclude-address-speed-up/19445

ankenyr commented 1 year ago

Hi, just checking in on this. If this sounds good I am happy to create a PR.

StephDC commented 2 months ago

Anyone interested in doing some benchmarks for these two approaches?

I was worried about that the summarize_address_range to be the most expensive operation and eats up all remaining speed improvement it brings. IMO hiding the splitting loop part into this function call that uses loops internally might be detrimental to the readability to those who are not so familiar with IP, unless there are more significant benefits.

ankenyr commented 2 months ago

Happy to do any work, I see @iritkatriel added the label of stdlib to this. @iritkatriel from the link to the previous discussion Guido mentions that there should be a discussion on the approach. https://discuss.python.org/t/ipaddress-py-exclude-address-speed-up/19445/7

Can you let me know if my outline in the first message is enough and if I can go ahead with a PR? If not please advise on the next steps.

@StephDC I have implemented this in my own library and have found improvements. Happy to do some benchmarks though.

StephDC commented 2 months ago

It is great that you have implemented it and found improvements. However, can you show us some real numbers so we can all agree that there are some improvements that worth the merge?

It would be helpful to include the following information:

Execution time for the same function with (columns)

  1. the current implementation
  2. your implementation

for at least the following examples: (rows)

  1. removing a whole block, e.g. 127.0.0.0/8 - 127.0.0.0/9
  2. removing a block of different size in the middle, e.g. 127.0.0.0/8 - 127.128.0.0/10
  3. removing a single IP, e.g. 127.0.0.0/8 - 127.1.191.82/32

And don't forget to check that both function return the same result.