This change corrects the search algorithm used by the uuid-annotator's routeview IP prefix to AS index to reliably find the longest prefix match. The routeview IP prefix to AS dataset provides network annotations for M-Lab's measurement platform.
Previously, the implementation erroneously sorted all network prefixes together and used a binary search to identify a prefix that contained a given IP. Unfortunately, this is incorrect. Network prefixes in the dataset are not mutually exclusive nor the same size. In particular, short prefixes (e.g. 12.0.0.0/9) may contain many (partial) longer prefixes (e.g. 12.189.156.0/24) and be associated with different ASNs. As a result, some searches never find the correct prefix, either because the search lands among longer prefixes that do not contain the given IP (~10% of the time), or (possibly) because an IP is misattributed to a shorter prefix when a longer prefix would have been a better match (unknown %).
Now, network prefixes are grouped into sets of the same block size. Network blocks of the same size cannot overlap. The search checks the index from longest to the shortest block (e.g. 32 to 8), searching for a network prefix of that block size that contains the given IP. If one is found, the search stops. Otherwise the search continues with the next shortest block size. This guarantees that the longest prefix will be found before shorter prefixes. And, if no prefix is found over the complete index, then we know that all possible networks were searched.
This change corrects the search algorithm used by the uuid-annotator's routeview IP prefix to AS index to reliably find the longest prefix match. The routeview IP prefix to AS dataset provides network annotations for M-Lab's measurement platform.
Previously, the implementation erroneously sorted all network prefixes together and used a binary search to identify a prefix that contained a given IP. Unfortunately, this is incorrect. Network prefixes in the dataset are not mutually exclusive nor the same size. In particular, short prefixes (e.g. 12.0.0.0/9) may contain many (partial) longer prefixes (e.g. 12.189.156.0/24) and be associated with different ASNs. As a result, some searches never find the correct prefix, either because the search lands among longer prefixes that do not contain the given IP (~10% of the time), or (possibly) because an IP is misattributed to a shorter prefix when a longer prefix would have been a better match (unknown %).
Now, network prefixes are grouped into sets of the same block size. Network blocks of the same size cannot overlap. The search checks the index from longest to the shortest block (e.g. 32 to 8), searching for a network prefix of that block size that contains the given IP. If one is found, the search stops. Otherwise the search continues with the next shortest block size. This guarantees that the longest prefix will be found before shorter prefixes. And, if no prefix is found over the complete index, then we know that all possible networks were searched.
Resolves
This change is