is there an out-of-the-box way to create an associative address trie that supports both IPv4 and IPv6?

SamerN88 commented 1 year ago

I'm trying to create an address trie that maps CIDRs to strings, and can do an IP lookup that gets multiple results.

I want to be able to efficiently search for a given IP address (both IPv4 and IPv6 supported), and get back the values associated with ALL the CIDRs that contain the IP address. I see there's an IPv4AddressTrie and an IPv6AddressTrie, as well as an abstract AddressTrie, but I'm not sure if there's an out-of-the-box way to do this. I would imagine something like:

AddressTrie<IPAddress, Value> trie = new AddressTrie<>();

IPAddress ip1 = new IPAddressString("1.2.3.4").getAddress();
IPAddress ip2 = new IPAddressString("1:2:3:4:5:6:7:8").getAddress();

List<Value> matchedValues1 = trie.prefixOfValues(ip1).collect(Collectors.toList());
List<Value> matchedValues2 = trie.prefixOfValues(ip2).collect(Collectors.toList());

I know this isn't exactly how AddressTrie is instantiated, but I'm just illustrating my goal here. Please let me know if there is already a solution implemented for this. Much appreciated.

seancfoley commented 1 year ago

The short answer is a qualified yes, although I don't think I have such a type in the library. I will have to check. I may have created something like that for a test.

While all the subnets/addresses in a trie must have the same bit-size, at least for the tries in this library, it is fairly trivial to create a binary tree that is a wrapper of two tries, one for IPv4 and one for IPv6. So any operation just defers to one or the other, depending on whether the operation argument is IPv4 or IPv6. I think I may have done something like this before, so I will take a look.

SamerN88 commented 1 year ago

it is fairly trivial to create a binary tree that is a wrapper of two tries, one for IPv4 and one for IPv6. So any operation just defers to one or the other, depending on whether the operation argument is IPv4 or IPv6.

That sounds like a straightforward enough solution, thanks. Is there any reason why one should prefer this over just using an IPv6AddressTrie and then converting all IPv4 addresses to IPv6 as they're passed in?

Also, when getting the values of all CIDRs that contain a given IP, what is the best way to convert the result to a list of values? Here's what I mean:

IPv6AddressAssociativeTrie<Value> trie = new IPv6AddressAssociativeTrie<>();
IPv6Address ip = new IPAddressString("1:2:3:4:5:6:7:8").getAddress();

trie.elementsContaining(ip)  // how to convert this to a List<Value>?

The only thing I could come up with was this, as .stream().collect() doesn't seem compatible:

List<Value> values = new ArrayList<>();
trie.forEach(addr -> values.addAll(trie.get(addr)));

seancfoley commented 1 year ago

Converting all IPv4 addresses to IPv6 IPv4-mapped addresses is another option, and it would work pretty well.

Yeah, I've had plans to change the return value of elementsContaining to a linked list. The Go library does that, in fact.

You can use this code to traverse the result:

        IPv4AddressAssociativeTrie.IPv4AssociativeTrieNode<Info> ipv4AssociativeTrieNode = trie.elementsContaining(ipv4Address);
        if (ipv4AssociativeTrieNode != null) {
            System.out.println("\nfound: " + ipv4AssociativeTrieNode.toTreeString(true, false));
            // ipv4AssociativeTrieNode is a linked list, go to the end
            while(true) {
                IPv4AssociativeTrieNode<Info> nextNode = ipv4AssociativeTrieNode.getLowerSubNode();
                if(nextNode == null) {
                    nextNode = ipv4AssociativeTrieNode.getUpperSubNode();
                    if(nextNode == null) {
                        break;
                    }
                }
                ipv4AssociativeTrieNode = nextNode;
            }
            // ipv4AssociativeTrieNode is at the end
}

To go in the reverse direction, use getParent().

No, the nodes of such a list or of a tree are not convertible to a stream. Nor the address keys of each node.

But that block of code you posted would work (trie.forEach), because it does provide an iterator, and it is "iterable", so the "forEach" method of Iterable is available. There are a number of iterators in fact.

If there is a way to convert iterators to streams (I am guessing there is, maybe check stackoverflow), then you can probably use a collect method when you have the resulting stream.

seancfoley commented 1 year ago

Actually, one way to get a stream is to create a new trie using asNewTrie https://seancfoley.github.io/IPAddress/IPAddress/apidocs/inet/ipaddr/format/util/AddressTrie.TrieNode.html#asNewTrie--

Then you can covert to a set or a map using asSet (or asMap for associative tries), then when you have a map or set there are methods to produce streams.

But this is a bit expensive because the asNewTrie method creates a new trie (which is actually a list when it came from elementContains).

seancfoley commented 1 year ago

I did not notice there is actually a spliterator method. So you can create a stream with java.util.stream.StreamSupport.stream(spliterator, false);

SamerN88 commented 1 year ago

Thanks a lot for the info. Could we assume that using forEach is not more expensive than spliterator?

Also, a question about efficiency: How exactly does elementsContaining get all the CIDRs that contain a given IP? I assume it would exploit the trie structure (i.e. does not just iterate over all nodes linearly).

seancfoley commented 1 year ago

Yes, elementsContaining is very efficient, it does a single trie lookup, and as it navigates a branch of the trie, it simply saves the branch. The trie lookup is efficient, it is a binary search.

After that, you basically have a linked list encoded as a trie (it's a list because each trie node has at most one subnode). It is the branch path that was traversed on the "elementsContaining" call.

At that point, you just have a list.

Anything I say now is really not specific to this library, it applies to lists/streams/iterators in general, and data structures in general.

An iterator will simply follow the list. A spliterator partitions the set of elements to be iterated, so that you can iterate the partitions at the same time, concurrently. In this case, the spliterator is not more efficient because it is not easy to split a linked list (you have to actually traverse the list to split into partitions). An array list is easy to partition, a linked list is not. The spliterator would be more efficient if it were a real binary tree and not just a single branch, because then it is easy to partition.

Also note that the list is typically short. On average it is log2(size of trie). So for a trie with 1000 addresses, the list is usually 10 nodes long, which is short.

The most efficient code for short lists is the block of code I wrote for you, or the use of an iterator. Using streams and spliterators is never efficient for small collections, they have too much overhead when setting up and cleaning up. Streams and spliterators are only more efficient when you have large collections and the spliterator is able to split up the collection well enough that you can traverse the stream in parallel with multiple threads. Also, when you have lots of individual operations on the collection, streams are useful for writing code that is concise and clear.

So, in this case, while using streams and spliterators can provide concise and pretty code, they are less efficient. These lists are nowhere near long enough for streams or spliterators to be efficient, and in addition, linked lists are not good candidates for parallel streams because they are difficult to split into partitions.

Using an iterator, or traversing the list with the block of code I provided, is by far the most efficient.

SamerN88 commented 1 year ago

This was all very useful.

The solution I landed on was to use an IPv6AddressAssociativeTrie under the hood of my custom data structure. Then inside my custom .add() method, every IP/CIDR passed in is converted to IPv6 (i.e. IPv4 addresses are mapped to IPv6). After running some tests it seems to be working exactly as I intended.

For my method that should return the values associated with all CIDRs that contain a given IP, I used your elementsContaining method as you've confirmed it is very efficient by exploiting the trie structure. Then, I use forEach to collect the values in a list (which is returned), as you've explained that forEach is by far the most efficient way due to it using an iterator.

Thus far I have not noticed any caveats with my implementation; if you can think of any caveats, I'd love to hear them! Thanks a lot for you help.

seancfoley commented 1 year ago

There are no caveats, really.

The only possible caveat is if you had the unusual situation in which you wanted to distinguish an IPv4-mapped IPv6 address from the associated IPv4 address.

In other words, if you wanted to distinguish "::ffff:102:304" from 1.2.3.4, your implementation could not do so.

However, for all intensive purposes, those two are the same address, so that is not much of a concern. An IPv4-mapped IPv6 address is the same as its associated IPv4 address, and in fact some programming languages treat them as the same address as well (like, for instance, the Go and Java languages).

seancfoley commented 8 months ago

I kept this open since I intended to add a dual IPv4/v6 trie type. That is now a part of version 5.5.0 with the type DualIPv4v6Tries. Closing.

SamerN88 commented 8 months ago

That's awesome, thank you!

seancfoley / IPAddress

is there an out-of-the-box way to create an associative address trie that supports both IPv4 and IPv6? #103