maxmind / MaxMind-DB-Reader-python

Python MaxMind DB reader extension
https://maxminddb.readthedocs.org/
Apache License 2.0
178 stars 37 forks source link

Scan whole maxmind database #23

Closed jbbqqf closed 11 months ago

jbbqqf commented 8 years ago

Hi,

I'm currently working on a python project involving geolocalization. Unfortunately, there is no way to scan a whole database to reconstruct it in a custom external format.

The reference perl implementation actually does, via iterate_search_tree (https://github.com/maxmind/MaxMind-DB-Reader-perl/blob/3fade689fa12708981fe70c5419a12a55561508a/lib/MaxMind/DB/Reader/PP.pm).

I tried to implement a python version of this function, but I'm not sure if my results are coherent with what I should get. Here is the code :

def iterate_search_tree(
        self,
        data_callback=lambda: None,
        ip_version=6,
        ):
    if ip_version == 6:
        max_bit_depth = 128
    elif ip_version == 4:
        max_bit_depth = 32
    else:
        raise ValueError('Invalid ip version')

    start_node = self._start_node(max_bit_depth)

    self._iterate_search_tree(
        data_callback=data_callback,
        start_node=start_node,
        bit_depth=1,
        max_bit_depth=max_bit_depth,
        decimal_ip=0,
    )

def _iterate_search_tree(
        self,
        data_callback=lambda: None,
        start_node=0,
        bit_depth=1,
        max_bit_depth=128,
        decimal_ip=0,
        ):

    for bit in [0, 1]:
        node = self._read_node(start_node + bit_depth, bit)

        if bit:
            decimal_ip = decimal_ip | 1 << (max_bit_depth - bit_depth)

        if node < self._metadata.node_count:
            if bit_depth > max_bit_depth:
            self._iterate_search_tree(data_callback=data_callback,
                                      start_node=start_node,
                                      bit_depth=bit_depth+1,
                                      max_bit_depth=max_bit_depth,
                                      decimal_ip=decimal_ip)
        elif node > self._metadata.node_count:
            mask = bit_depth + bit
            data = self._resolve_data_pointer(node)
            data_callback(decimal_ip, mask, data)
        else:
            pass

When I call iterate_search_tree in a trivial script counting entries via data_callback, I just get crazy results : a total of 2M+ ipv4/ipv6 detected, whereas a trivial script counting entries in the reference perl implementation counts 780K of them (both ipv4 and ipv6) for the same database file. I spent time trying to figure out what could go wrong but I don't manage to debug it.

I was wondering if you already thought about including this kind of feature in your package. If so, could you check what's wrong in my code and potentially include it in your reader ?

Thanks,

oschwald commented 8 years ago

I glanced at this and it seemed reasonable. To debug it, I'd probably start with one of the test databases and compare the output. It should be pretty straightforward to see what is going on. The Go reader also provides this functionality if looking at another implementation would be helpful.

paravoid commented 5 years ago

So I wanted to get a mapping of autonomous system numbers to names, and surprisingly this wasn't available in an easy to consume form (e.g. CSV) anywhere.

So I gave this a stab, starting from the Perl code and then optimizing it and making the interface a bit more Pythonic. This is it:

from ipaddress import IPv4Network, IPv6Network

def __iter__(self):
    if self._metadata.ip_version == 4:
        start_node = self._start_node(32)
        start_network = IPv4Network((0, 0))
    else:
        start_node = self._start_node(128)
        start_network = IPv6Network((0, 0))

    search_nodes = [(start_node, start_network)]
    while search_nodes:
        node, network = search_nodes.pop()

        if network.version == 6:
            naddr = network.network_address
            if naddr.ipv4_mapped or naddr.sixtofour:
                # skip IPv4-Mapped IPv6 and 6to4 mapped addresses, as these are
                # already included in the IPv4 part of the tree below
                continue
            elif int(naddr) < 2 ** 32 and network.prefixlen == 96:
                # once in the IPv4 part of the tree, switch to IPv4Network
                ipnum = int(naddr)
                mask = network.prefixlen - 128 + 32
                network = IPv4Network((ipnum, mask))

        subnets = list(network.subnets())
        for bit in (0, 1):
            next_node = self._read_node(node, bit)
            subnet = subnets[bit]

            if next_node > self._metadata.node_count:
                data = self._resolve_data_pointer(next_node)
                yield (subnet, data)
            elif next_node < self._metadata.node_count:
                search_nodes.append((next_node, subnet))

Notes:

Note: the above code snippet is Copyright 2018 Faidon Liambotis and dual-licensed under 1) the Apache License, Version 2.0 (SPDX identifier Apache-2.0) as published by the Apache Software Foundation: https://www.apache.org/licenses/LICENSE-2.0, 2) the 0-clause BSD license (SPDX identifier 0BSD), as published e.g. by the Open Source Initiative: https://opensource.org/licenses/0BSD.

oschwald commented 5 years ago

At a high level, this looks great! The one thing that I am not sure about is the pruning of the IPv4-mapped IPv6 networks. Although the MaxMind databases do map these (and 2001::/32 to IPv4), it is not part of the format and may not be the right thing to do on all databases. Perhaps this could be made optional somehow.

Also, if we do add this to the reader, I think we will want to update the C extension to also support it. libmaxminddb should have the required functionality already. The Perl XS reader supports iterating over the databases.

volans- commented 1 year ago

Is there any news on this feature? I was looking for a way to extract a mapping of ASN -> list of prefixes, and because AFAIK it's not available in any of the offered DBs I ended up here. I can confirm that the above proposed patch from paravoid still works with the current version in PyPi and on my laptop in 2.5 minutes using 750MB of RAM I was able to get a mapping ASN -> set(prefixes) for the whole GeoIP2-ISP.mmdb DB.

oschwald commented 1 year ago

There is no news on this feature. To be included in the reader, we would still need to address everything in my previous comment and add appropriate tests. In regards to skipping the aliased nodes, it would probably be better if it was implemented more similarly to the Go reader and also made to be optional (but defaulting to skipping makes sense).

oschwald commented 11 months ago

Thanks for the request. We have implemented this and it will be in our next release.