Scan whole maxmind database

jbbqqf commented 8 years ago

Hi,

I'm currently working on a python project involving geolocalization. Unfortunately, there is no way to scan a whole database to reconstruct it in a custom external format.

The reference perl implementation actually does, via iterate_search_tree (https://github.com/maxmind/MaxMind-DB-Reader-perl/blob/3fade689fa12708981fe70c5419a12a55561508a/lib/MaxMind/DB/Reader/PP.pm).

I tried to implement a python version of this function, but I'm not sure if my results are coherent with what I should get. Here is the code :

def iterate_search_tree(
        self,
        data_callback=lambda: None,
        ip_version=6,
        ):
    if ip_version == 6:
        max_bit_depth = 128
    elif ip_version == 4:
        max_bit_depth = 32
    else:
        raise ValueError('Invalid ip version')

    start_node = self._start_node(max_bit_depth)

    self._iterate_search_tree(
        data_callback=data_callback,
        start_node=start_node,
        bit_depth=1,
        max_bit_depth=max_bit_depth,
        decimal_ip=0,
    )

def _iterate_search_tree(
        self,
        data_callback=lambda: None,
        start_node=0,
        bit_depth=1,
        max_bit_depth=128,
        decimal_ip=0,
        ):

    for bit in [0, 1]:
        node = self._read_node(start_node + bit_depth, bit)

        if bit:
            decimal_ip = decimal_ip | 1 << (max_bit_depth - bit_depth)

        if node < self._metadata.node_count:
            if bit_depth > max_bit_depth:
            self._iterate_search_tree(data_callback=data_callback,
                                      start_node=start_node,
                                      bit_depth=bit_depth+1,
                                      max_bit_depth=max_bit_depth,
                                      decimal_ip=decimal_ip)
        elif node > self._metadata.node_count:
            mask = bit_depth + bit
            data = self._resolve_data_pointer(node)
            data_callback(decimal_ip, mask, data)
        else:
            pass

When I call iterate_search_tree in a trivial script counting entries via data_callback, I just get crazy results : a total of 2M+ ipv4/ipv6 detected, whereas a trivial script counting entries in the reference perl implementation counts 780K of them (both ipv4 and ipv6) for the same database file. I spent time trying to figure out what could go wrong but I don't manage to debug it.

I was wondering if you already thought about including this kind of feature in your package. If so, could you check what's wrong in my code and potentially include it in your reader ?

Thanks,

oschwald commented 8 years ago

I glanced at this and it seemed reasonable. To debug it, I'd probably start with one of the test databases and compare the output. It should be pretty straightforward to see what is going on. The Go reader also provides this functionality if looking at another implementation would be helpful.

paravoid commented 5 years ago

So I wanted to get a mapping of autonomous system numbers to names, and surprisingly this wasn't available in an easy to consume form (e.g. CSV) anywhere.

So I gave this a stab, starting from the Perl code and then optimizing it and making the interface a bit more Pythonic. This is it:

from ipaddress import IPv4Network, IPv6Network

def __iter__(self):
    if self._metadata.ip_version == 4:
        start_node = self._start_node(32)
        start_network = IPv4Network((0, 0))
    else:
        start_node = self._start_node(128)
        start_network = IPv6Network((0, 0))

    search_nodes = [(start_node, start_network)]
    while search_nodes:
        node, network = search_nodes.pop()

        if network.version == 6:
            naddr = network.network_address
            if naddr.ipv4_mapped or naddr.sixtofour:
                # skip IPv4-Mapped IPv6 and 6to4 mapped addresses, as these are
                # already included in the IPv4 part of the tree below
                continue
            elif int(naddr) < 2 ** 32 and network.prefixlen == 96:
                # once in the IPv4 part of the tree, switch to IPv4Network
                ipnum = int(naddr)
                mask = network.prefixlen - 128 + 32
                network = IPv4Network((ipnum, mask))

        subnets = list(network.subnets())
        for bit in (0, 1):
            next_node = self._read_node(node, bit)
            subnet = subnets[bit]

            if next_node > self._metadata.node_count:
                data = self._resolve_data_pointer(next_node)
                yield (subnet, data)
            elif next_node < self._metadata.node_count:
                search_nodes.append((next_node, subnet))

Notes:

I've tested this only with the GeoIP2-ISP database, with both Python 3.7 and 2.7.
This is supposed to be included as a method for the Reader class; it's a generator, so to use it one can just iterate over the Reader object, like for network, data in isp_reader: […], and network would be an IPv4Network or IPv6Network and data the data for that network.
While I originally implemented this using recursion, I've opted in replacing that with a two-tuple list for two reasons: a) to avoid hitting (or fiddling with) Python's recursion limits b) to avoid using yield from (which is a newer construct) or ugly for/yield iterations, and more generally, to avoid generator recursion.
The order now is LIFO (i.e. pseudo-depth first), because that's simpler and more efficient; if a different order is desired, one could use collections.deque to implement it differently and still efficiently. I don't think order should matter, or that the API should make any guarantees about that, though.
The code currently prunes the search tree at ::ffff:0:0/96 (IPv4-mapped) and 2002::/16 (6to4), as these are just pointers to the IPv4 space in the MaxMind databases. This is an optimization, to avoid the iteration of the IPv4 space 3 times, but I could see how it could be considered technically wrong too, as these IPv6 subnets do exist in the database.
More broadly, if I'm reading the spec right, IPv4 addresses in IPv6 databases is an implementation detail left to the vendor, and thus the whole if network.version == 6 block is currently MaxMind implementation-specific and possibly doesn't belong in this library. However, besides the speed optimization mentioned above, it is also making the API mich more sensible, by returning IPv4Networks for IPv4 networks, instead of e.g. IPv6Network("::100:100/120") for 1.0.1.0/24.

Note: the above code snippet is Copyright 2018 Faidon Liambotis and dual-licensed under 1) the Apache License, Version 2.0 (SPDX identifier Apache-2.0) as published by the Apache Software Foundation: https://www.apache.org/licenses/LICENSE-2.0, 2) the 0-clause BSD license (SPDX identifier 0BSD), as published e.g. by the Open Source Initiative: https://opensource.org/licenses/0BSD.

oschwald commented 5 years ago

At a high level, this looks great! The one thing that I am not sure about is the pruning of the IPv4-mapped IPv6 networks. Although the MaxMind databases do map these (and 2001::/32 to IPv4), it is not part of the format and may not be the right thing to do on all databases. Perhaps this could be made optional somehow.

Also, if we do add this to the reader, I think we will want to update the C extension to also support it. libmaxminddb should have the required functionality already. The Perl XS reader supports iterating over the databases.

volans- commented 1 year ago

Is there any news on this feature? I was looking for a way to extract a mapping of ASN -> list of prefixes, and because AFAIK it's not available in any of the offered DBs I ended up here. I can confirm that the above proposed patch from paravoid still works with the current version in PyPi and on my laptop in 2.5 minutes using 750MB of RAM I was able to get a mapping ASN -> set(prefixes) for the whole GeoIP2-ISP.mmdb DB.

oschwald commented 1 year ago

There is no news on this feature. To be included in the reader, we would still need to address everything in my previous comment and add appropriate tests. In regards to skipping the aliased nodes, it would probably be better if it was implemented more similarly to the Go reader and also made to be optional (but defaulting to skipping makes sense).

oschwald commented 11 months ago

Thanks for the request. We have implemented this and it will be in our next release.

maxmind / MaxMind-DB-Reader-python

Scan whole maxmind database #23