Closed stephen-soltesz closed 1 year ago
I think if annotation logic had a unique annotation for addresses that are not found in the RV dataset, then we can easily find out if there are "leaks" in the annotation logic. IOW, an unannotated address would imply a bug in the annotation logic. Hope this makes sense.
Yes, @SaiedKazemi - this was the intention of the Missing bool
- to help differentiate between affirmative searches that found no matches and failures to annotate anything. What we find in the data is that both IP-not-found-in-RV and IP-found-in-RV-but-not-annotated are currently reported as Missing=true
. So, something is missing from the logic to provide the distinction you're describing (or some other bug preventing lookups that should be found).
Looking more closely at the distribution of these missing annotations:
WITH counts AS (
SELECT server.Site as site, count(*) as total
FROM `measurement-lab.ndt.ndt7`
WHERE client.Network.Missing = true and date = "2023-01-21"
GROUP BY site
ORDER BY total DESC
)
SELECT site, total, total / (SUM(total) OVER()) ratio
FROM counts
ORDER BY ratio desc
The top ten sites with missing network annotations:
Row | site | total | ratio | |
---|---|---|---|---|
1 | yul06 | 111134 | 0.16018137765728213 | |
2 | yul03 | 110083 | 0.15866653406380216 | |
3 | yul05 | 109992 | 0.15853537253477581 | |
4 | yul04 | 109433 | 0.15772966599932833 | |
5 | sin01 | 18400 | 0.026520572902028104 | |
6 | yul02 | 17157 | 0.0247289928956574 | |
7 | yul07 | 8949 | 0.012898511244578777 | |
8 | del02 | 5258 | 0.0075785419738512915 | |
9 | dfw02 | 5179 | 0.0074646764706306273 | |
10 | dfw08 | 4995 | 0.007199470 |
~67% from YUL sites.
Checking to see if there is a common network origin (looking only at IPv4 addresses):
WITH counts AS (
SELECT
NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 8)) as site,
count(*) as total, COUNTIF(client.Network.Missing = true) as missing_total
FROM `measurement-lab.ndt.ndt7`
WHERE date = "2023-01-21" AND server.Site LIKE '%yul%' AND raw.ClientIP NOT LIKE '%:%'
GROUP BY site
ORDER BY total DESC
)
SELECT site, total, missing_total, missing_total / (SUM(total) OVER()) ratio
FROM counts
ORDER BY ratio desc
~90% of missing annotations come from a network within 12.0.0.0/8
- which is owned by AT&T ,which should be announced, and from a manual spot check of a recent routeviews-rv2-20230101-1200.pfx2as
file shows that 12.0.0.0/8
is present (with many 12.* subnetworks) but evidently not found.
This may be a deeper bug in the uuid-annotator.. but strangely scoped to certain networks over others.
@stephen-soltesz is the /8 the longest matching prefix for those IPs?
@phillipa - I did not confirm, but I would predict not. I used /8 as a quick check to see if there was a visible high-level pattern, maybe evenly distributed (e.g. random), or if the addresses were from squatted networks. This was a surprise.
i think you'd be hard pressed to find a squatted /8.
FWIW 4.0.0.0/8 and 212.0.0.0/8 have a fair number of missing annotations. Going to see if there's some common /16s or /24s.
SELECT NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 8)),
SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count()) as ratio_missing,
count() as total_tests
FROM measurement-lab.ndt.ndt7
WHERE date = "2023-01-21"
AND raw.ClientIP NOT LIKE '%:%'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000;
SELECT NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 16)),
SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count()) as ratio_missing,
count() as total_tests
FROM measurement-lab.ndt.ndt7
WHERE date = "2023-01-21"
AND raw.ClientIP NOT LIKE '%:%'
AND (raw.ClientIP LIKE '12.%' OR raw.ClientIP like '4.%' OR raw.ClientIP like '212.%')
GROUP BY 1
ORDER BY 3 DESC ,2 DESC
LIMIT 1000;
Looks like that one /16 (12.189.0.0/16) is a lot of the missing annotations. According to RIPEStat it's longest covering prefix is the /9 that AT&T announces. https://stat.ripe.net/ui2013/12.189.0.0#tabId=at-a-glance
the 4.78.0.0/16 address is covered by a /9 announced by Level3.
You might look at the cases where sometimes the prefix is getting resolved to see if there's some dynamics that might explain it?
I have confirmed that the bug is in uuid-annotator's lookup. For example, in the case of 12.189.157.x, it correctly finds that there are no /24 prefixes containing this address but fails to find the next longest prefix (/23 to /8) that does contain it. These are false-negative errors. Based on our annotations, this occurs about 10% of rows. When the lookup finds a match, the IP is in fact in the identified network. These are true-positives. There are no false-positives. The set of true-negatives is obscured by the false-negatives.
I'm preparing a fix so the next longest prefixes are searched correctly.
SELECT
NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 24)) as network,
SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count(*)) as ratio_missing,
count(*) as total_tests
FROM measurement-lab.ndt.ndt7
WHERE date = "2023-01-21"
AND raw.ClientIP NOT LIKE '%:%'
AND raw.ClientIP LIKE '12.189.%'
GROUP BY network
ORDER BY total_tests DESC
Row | network | ratio_missing | total_tests | |
---|---|---|---|---|
1 | 12.189.157.0 | 1.0 | 468007 | |
2 | 12.189.112.0 | 1.0 | 4 | |
3 | 12.189.234.0 | 1.0 | 3 | |
4 | 12.189.88.0 | 1.0 | 2 | |
5 | 12.189.113.0 | 1.0 | 2 | |
6 | 12.189.47.0 | 1.0 | 2 | |
7 | 12.189.124.0 | 1.0 | 2 | |
8 | 12.189.136.0 | 1.0 | 1 | |
9 | 12.189.247.0 | 1.0 | 1 | |
10 | 12.189.40.0 | 1.0 | 1 |
any idea why the error is there for this prefix but not others?
@phillipa please see the PR description of https://github.com/m-lab/uuid-annotator/pull/49 -- false-positives are possible in principle, so unfortunately this is worse than I thought. I will run a trial next week to estimate the percentage of annotations that might be incorrectly labeled. e.g.
Since there are fewer short prefixes, I'm optimistic that misattributions will be proportionately rarer. But, I can't be sure without a test.
You mean the existing annotations could be wrong, correct? the method in the PR seems like it should work fine.
I'm not sure the details on your implementation but if this does end up slow you might consider stepping down to binary operations for more of the comparisons. Routers do this stuff basically at line rate. There's optimization opportunities if needed.
Yes, I believe so, but the extent is unknown (the trial I propose would give an estimate). Additionally, I've realized that the method in the uuid-annotator was a reimplementation of the method in the original annotation service - https://github.com/m-lab/annotation-service/blob/master/asn/asn-annotator.go - so I'll need to take a closer look there to see if this is limited to the uuid-annotator or possibly the complete history of M-Lab's data.
Right. If the IP to AS mapping is buggy you will likely need to rerun it everywhere it was used.
On Sat, Feb 4, 2023 at 12:18 PM Stephen Soltesz @.***> wrote:
Yes, I believe so, but the extent is unknown (the trial I propose would give an estimate). Additionally, I've realized that the method in the uuid-annotator was a reimplementation of the method in the original annotation service - https://github.com/m-lab/annotation-service/blob/master/asn/asn-annotator.go
- so I'll need to take a closer look there to see if this is limited to the uuid-annotator or possibly the complete history of M-Lab's data.
— Reply to this email directly, view it on GitHub https://github.com/m-lab/uuid-annotator/issues/48#issuecomment-1416804359, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPNEEWFSHAG2C6IWW2OUTWV2FNJANCNFSM6AAAAAATSCC2NU . You are receiving this because you were mentioned.Message ID: @.***>
--
*If you get an email from me outside of the 9-5 it is not because I am always online or expect an immediate response from you; it is because work flexibility http://www.inc.com/john-boitnott/how-flexible-hours-can-create-a-better-work-life-balance.html is important to me. Random evening and weekend emails are a sign that I prioritized personal obligations.
Continuing investigation: annotations generated by the annotation-service are not affected by this bug. The annotation-service parsed the routeview prefixes in order to generate a sequence of network ranges that provided complete coverage and ASN attribution for all sub ranges. For example:
Becomes a sequence of three ranges:
This set of ranges was searched using a binary search. The uuid-annotator did not preserve the range parsing and instead only uses the original CIDR prefixes, which leads to the possibility of missing annotations or misattributed annotations.
The uuid-annotator with ASN annotations was first deployed to the production platform March 10, 2020 as part of the https://github.com/m-lab/k8s-support/releases/tag/v2.9.0 release. So, all annotations collected from 2022-03-10 to today are affected.
I will imagine a method to regenerate and correct these annotations and estimate how much effort this will take.
The update to uuid-annotator was deployed as part of https://github.com/m-lab/k8s-support/releases/tag/v2.35.3 on 2023-02-08 and completed rollout on 2023-02-09. So the first date with fully restored annotations is 2023-02-10.
According to our monitoring, the portion of successful annotations has increased in all datasets (Note: prometheus dates represent data collected date-2days earlier). In particular, the ndt5/ndt7 datasets are above 99%. The scamper1/tcpinfo datasets are above 98%.
Next, I'll add an entry to github.com/m-lab/data-annotations and send a note to discuss@ with a summary of the outage and expected timeline for fixes.
That's great. But do we know anything about the accuracy of the prior annotations? It's possible the bug only became visible when the result was not finding an ASN but there may be cases where it causes a wrong mapping.
On Mon, Feb 13, 2023 at 10:45 AM Stephen Soltesz @.***> wrote:
The update to uuid-annotator was deployed as part of https://github.com/m-lab/k8s-support/releases/tag/v2.35.3 on 2022-02-08 and completed rollout on 2022-02-09. So the first date with fully restored annotations is 2022-02-10.
According to our monitoring, the portion of successful annotations has increased in all datasets (Note: prometheus dates represent data collected date-2days earlier). In particular, the ndt5/ndt7 datasets are above 99%. The scamper1/tcpinfo datasets are above 98%.
[image: Screen Shot 2023-02-13 at 10 37 42 AM] https://user-images.githubusercontent.com/1085316/218502506-21a8c6c5-511f-41c9-89d6-121717e8bd03.png
Next, I'll add an entry to github.com/m-lab/data-annotations and send a note to discuss@ with a summary of the outage and expected timeline for fixes.
— Reply to this email directly, view it on GitHub https://github.com/m-lab/uuid-annotator/issues/48#issuecomment-1428169618, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPNEEXMX2Z6UUW3C6MSYLWXJJJ7ANCNFSM6AAAAAATSCC2NU . You are receiving this because you were mentioned.Message ID: @.***>
I can provide an estimate based on the prototype for the historical annotation repair. We knew that ~10% of annotations were missing from ndt7 (less for other datasets) and that in principle misattributed ASNs were possible.
From limited trials with the prototype, we can estimate (with some of caveats) that misattribution occurred significantly less often in general for all annotations and for ndt7 specifically. After running the prototype fixer on ~500k total annotations in both cases, we see ~7-10% missed and ~1-2% misattributed.
All annotations: total: 500283 missed: 37168 wrongasn: 6361 NDT7 annotations: total: 515202 missed: 50564 wrongasn: 8705
The corrected lookup logic in the uuid-annotator has been complete since 2023-02-10.
The reannotation and reprocessing of the ndt.annotation2
and ndt.hopannotation2
data is now complete for all ndt data. See last update to https://github.com/m-lab/data-annotations/issues/34#issuecomment-1538832929.
Recently, we discovered a bug in our annotation monitoring https://github.com/m-lab/prometheus-support/pull/961 that underestimated the rate of missing annotations.
After deploying this fix, we've discovered that the true network annotation rate hovers around 90% (10% missing). The network annotations are derived from RouteViews dataset. See: https://github.com/m-lab/dev-tracker/issues/752
What is the root cause of the 10% missing annotations?
I spot checked a few
raw.ClientIP
addresses from the ndt7 table withclient.Network.Missing=true
and looked for the associated network in the raw RV files (those used by the uuid-annotator). Both IPs were markedMissing=true
,So, there may be multiple causes of failed annotations. The success rate can likely be improved. An estimate of the false-negative rate (i.e. missing=true when the network is in RV dataset) would help estimate the value of debugging and resolution.