Annotation failures - Githubissues

stephen-soltesz commented 1 year ago

Recently, we discovered a bug in our annotation monitoring https://github.com/m-lab/prometheus-support/pull/961 that underestimated the rate of missing annotations.

After deploying this fix, we've discovered that the true network annotation rate hovers around 90% (10% missing). The network annotations are derived from RouteViews dataset. See: https://github.com/m-lab/dev-tracker/issues/752

What is the root cause of the 10% missing annotations?

Are there tests coming from networks that are truly not found in the RV dataset?
Are the clients from networks that are in the RV dataset, but some other bug is causing annotation failures?

I spot checked a few raw.ClientIP addresses from the ndt7 table with client.Network.Missing=true and looked for the associated network in the raw RV files (those used by the uuid-annotator). Both IPs were marked Missing=true,

I found one in the raw text RV dataset for 2023-01-0[123] - false-missing (so a failure to annotate for unknown reasons)
For a second IP, I did not find it in the RV dataset - true-missing

So, there may be multiple causes of failed annotations. The success rate can likely be improved. An estimate of the false-negative rate (i.e. missing=true when the network is in RV dataset) would help estimate the value of debugging and resolution.

SaiedKazemi commented 1 year ago

I think if annotation logic had a unique annotation for addresses that are not found in the RV dataset, then we can easily find out if there are "leaks" in the annotation logic. IOW, an unannotated address would imply a bug in the annotation logic. Hope this makes sense.

stephen-soltesz commented 1 year ago

Yes, @SaiedKazemi - this was the intention of the Missing bool - to help differentiate between affirmative searches that found no matches and failures to annotate anything. What we find in the data is that both IP-not-found-in-RV and IP-found-in-RV-but-not-annotated are currently reported as Missing=true. So, something is missing from the logic to provide the distinction you're describing (or some other bug preventing lookups that should be found).

stephen-soltesz commented 1 year ago

Looking more closely at the distribution of these missing annotations:

WITH counts AS (
  SELECT server.Site as site, count(*) as total
  FROM `measurement-lab.ndt.ndt7`
  WHERE client.Network.Missing = true and date = "2023-01-21"
  GROUP BY site
  ORDER BY total DESC
)

SELECT site, total, total / (SUM(total) OVER()) ratio
FROM counts
ORDER BY ratio desc

The top ten sites with missing network annotations:

Row	site	total	ratio
1	yul06	111134	0.16018137765728213
2	yul03	110083	0.15866653406380216
3	yul05	109992	0.15853537253477581
4	yul04	109433	0.15772966599932833
5	sin01	18400	0.026520572902028104
6	yul02	17157	0.0247289928956574
7	yul07	8949	0.012898511244578777
8	del02	5258	0.0075785419738512915
9	dfw02	5179	0.0074646764706306273
10	dfw08	4995	0.007199470

~67% from YUL sites.

stephen-soltesz commented 1 year ago

Checking to see if there is a common network origin (looking only at IPv4 addresses):

WITH counts AS (
  SELECT
    NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 8)) as site,
    count(*) as total, COUNTIF(client.Network.Missing = true) as missing_total
  FROM `measurement-lab.ndt.ndt7`
  WHERE date = "2023-01-21" AND server.Site LIKE '%yul%' AND raw.ClientIP NOT LIKE '%:%'
  GROUP BY site
  ORDER BY total DESC
)

SELECT site, total, missing_total, missing_total / (SUM(total) OVER()) ratio
FROM counts
ORDER BY ratio desc

~90% of missing annotations come from a network within 12.0.0.0/8 - which is owned by AT&T ,which should be announced, and from a manual spot check of a recent routeviews-rv2-20230101-1200.pfx2as file shows that 12.0.0.0/8 is present (with many 12.* subnetworks) but evidently not found.

This may be a deeper bug in the uuid-annotator.. but strangely scoped to certain networks over others.

phillipa commented 1 year ago

@stephen-soltesz is the /8 the longest matching prefix for those IPs?

stephen-soltesz commented 1 year ago

@phillipa - I did not confirm, but I would predict not. I used /8 as a quick check to see if there was a visible high-level pattern, maybe evenly distributed (e.g. random), or if the addresses were from squatted networks. This was a surprise.

phillipa commented 1 year ago

i think you'd be hard pressed to find a squatted /8.

phillipa commented 1 year ago

FWIW 4.0.0.0/8 and 212.0.0.0/8 have a fair number of missing annotations. Going to see if there's some common /16s or /24s.

Top prefixes /8

SELECT NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 8)), SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count()) as ratio_missing, count() as total_tests FROM measurement-lab.ndt.ndt7 WHERE date = "2023-01-21" AND raw.ClientIP NOT LIKE '%:%' GROUP BY 1 ORDER BY 2 DESC LIMIT 1000;

phillipa commented 1 year ago

Top prefixes /16

SELECT NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 16)), SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count()) as ratio_missing, count() as total_tests FROM measurement-lab.ndt.ndt7 WHERE date = "2023-01-21" AND raw.ClientIP NOT LIKE '%:%' AND (raw.ClientIP LIKE '12.%' OR raw.ClientIP like '4.%' OR raw.ClientIP like '212.%') GROUP BY 1 ORDER BY 3 DESC ,2 DESC LIMIT 1000;

Looks like that one /16 (12.189.0.0/16) is a lot of the missing annotations. According to RIPEStat it's longest covering prefix is the /9 that AT&T announces. https://stat.ripe.net/ui2013/12.189.0.0#tabId=at-a-glance

the 4.78.0.0/16 address is covered by a /9 announced by Level3.

You might look at the cases where sometimes the prefix is getting resolved to see if there's some dynamics that might explain it?

stephen-soltesz commented 1 year ago

I have confirmed that the bug is in uuid-annotator's lookup. For example, in the case of 12.189.157.x, it correctly finds that there are no /24 prefixes containing this address but fails to find the next longest prefix (/23 to /8) that does contain it. These are false-negative errors. Based on our annotations, this occurs about 10% of rows. When the lookup finds a match, the IP is in fact in the identified network. These are true-positives. There are no false-positives. The set of true-negatives is obscured by the false-negatives.

I'm preparing a fix so the next longest prefixes are searched correctly.

SELECT
   NET.IP_TO_STRING(NET.IP_TRUNC(NET.IP_FROM_STRING(raw.ClientIP), 24)) as network,
   SAFE_DIVIDE(COUNTIF(client.Network.Missing = True), count(*)) as ratio_missing,
   count(*) as total_tests
FROM measurement-lab.ndt.ndt7
WHERE date = "2023-01-21"
AND raw.ClientIP NOT LIKE '%:%'
AND raw.ClientIP LIKE '12.189.%'
GROUP BY network
ORDER BY total_tests DESC

Row	network	ratio_missing	total_tests
1	12.189.157.0	1.0	468007
2	12.189.112.0	1.0	4
3	12.189.234.0	1.0	3
4	12.189.88.0	1.0	2
5	12.189.113.0	1.0	2
6	12.189.47.0	1.0	2
7	12.189.124.0	1.0	2
8	12.189.136.0	1.0	1
9	12.189.247.0	1.0	1
10	12.189.40.0	1.0	1

phillipa commented 1 year ago

any idea why the error is there for this prefix but not others?

stephen-soltesz commented 1 year ago

@phillipa please see the PR description of https://github.com/m-lab/uuid-annotator/pull/49 -- false-positives are possible in principle, so unfortunately this is worse than I thought. I will run a trial next week to estimate the percentage of annotations that might be incorrectly labeled. e.g.

take a days worth of IPs and annotations using the old method
annotate the same IPs using the new method and count the number of mismatches.

Since there are fewer short prefixes, I'm optimistic that misattributions will be proportionately rarer. But, I can't be sure without a test.

phillipa commented 1 year ago

You mean the existing annotations could be wrong, correct? the method in the PR seems like it should work fine.

I'm not sure the details on your implementation but if this does end up slow you might consider stepping down to binary operations for more of the comparisons. Routers do this stuff basically at line rate. There's optimization opportunities if needed.

stephen-soltesz commented 1 year ago

Yes, I believe so, but the extent is unknown (the trial I propose would give an estimate). Additionally, I've realized that the method in the uuid-annotator was a reimplementation of the method in the original annotation service - https://github.com/m-lab/annotation-service/blob/master/asn/asn-annotator.go - so I'll need to take a closer look there to see if this is limited to the uuid-annotator or possibly the complete history of M-Lab's data.

phillipa commented 1 year ago

Right. If the IP to AS mapping is buggy you will likely need to rerun it everywhere it was used.

On Sat, Feb 4, 2023 at 12:18 PM Stephen Soltesz @.***> wrote:

Yes, I believe so, but the extent is unknown (the trial I propose would give an estimate). Additionally, I've realized that the method in the uuid-annotator was a reimplementation of the method in the original annotation service - https://github.com/m-lab/annotation-service/blob/master/asn/asn-annotator.go

so I'll need to take a closer look there to see if this is limited to the uuid-annotator or possibly the complete history of M-Lab's data.

— Reply to this email directly, view it on GitHub https://github.com/m-lab/uuid-annotator/issues/48#issuecomment-1416804359, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPNEEWFSHAG2C6IWW2OUTWV2FNJANCNFSM6AAAAAATSCC2NU . You are receiving this because you were mentioned.Message ID: @.***>

--

*If you get an email from me outside of the 9-5 it is not because I am always online or expect an immediate response from you; it is because work flexibility http://www.inc.com/john-boitnott/how-flexible-hours-can-create-a-better-work-life-balance.html is important to me. Random evening and weekend emails are a sign that I prioritized personal obligations.

stephen-soltesz commented 1 year ago

Continuing investigation: annotations generated by the annotation-service are not affected by this bug. The annotation-service parsed the routeview prefixes in order to generate a sequence of network ranges that provided complete coverage and ASN attribution for all sub ranges. For example:

12.0.0.0/8 -> A
12.0.48.0/20 -> B

Becomes a sequence of three ranges:

12.0.0.0 - 12.0.47.255 -> A
12.0.48.0 - 12.0.63.255 -> B
12.0.64.0 - 12.255.255.255 -> A

This set of ranges was searched using a binary search. The uuid-annotator did not preserve the range parsing and instead only uses the original CIDR prefixes, which leads to the possibility of missing annotations or misattributed annotations.

The uuid-annotator with ASN annotations was first deployed to the production platform March 10, 2020 as part of the https://github.com/m-lab/k8s-support/releases/tag/v2.9.0 release. So, all annotations collected from 2022-03-10 to today are affected.

I will imagine a method to regenerate and correct these annotations and estimate how much effort this will take.

stephen-soltesz commented 1 year ago

The update to uuid-annotator was deployed as part of https://github.com/m-lab/k8s-support/releases/tag/v2.35.3 on 2023-02-08 and completed rollout on 2023-02-09. So the first date with fully restored annotations is 2023-02-10.

According to our monitoring, the portion of successful annotations has increased in all datasets (Note: prometheus dates represent data collected date-2days earlier). In particular, the ndt5/ndt7 datasets are above 99%. The scamper1/tcpinfo datasets are above 98%.

Next, I'll add an entry to github.com/m-lab/data-annotations and send a note to discuss@ with a summary of the outage and expected timeline for fixes.

phillipa commented 1 year ago

That's great. But do we know anything about the accuracy of the prior annotations? It's possible the bug only became visible when the result was not finding an ASN but there may be cases where it causes a wrong mapping.

On Mon, Feb 13, 2023 at 10:45 AM Stephen Soltesz @.***> wrote:

The update to uuid-annotator was deployed as part of https://github.com/m-lab/k8s-support/releases/tag/v2.35.3 on 2022-02-08 and completed rollout on 2022-02-09. So the first date with fully restored annotations is 2022-02-10.

According to our monitoring, the portion of successful annotations has increased in all datasets (Note: prometheus dates represent data collected date-2days earlier). In particular, the ndt5/ndt7 datasets are above 99%. The scamper1/tcpinfo datasets are above 98%.

[image: Screen Shot 2023-02-13 at 10 37 42 AM] https://user-images.githubusercontent.com/1085316/218502506-21a8c6c5-511f-41c9-89d6-121717e8bd03.png

Next, I'll add an entry to github.com/m-lab/data-annotations and send a note to discuss@ with a summary of the outage and expected timeline for fixes.

— Reply to this email directly, view it on GitHub https://github.com/m-lab/uuid-annotator/issues/48#issuecomment-1428169618, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHPNEEXMX2Z6UUW3C6MSYLWXJJJ7ANCNFSM6AAAAAATSCC2NU . You are receiving this because you were mentioned.Message ID: @.***>

stephen-soltesz commented 1 year ago

I can provide an estimate based on the prototype for the historical annotation repair. We knew that ~10% of annotations were missing from ndt7 (less for other datasets) and that in principle misattributed ASNs were possible.

From limited trials with the prototype, we can estimate (with some of caveats) that misattribution occurred significantly less often in general for all annotations and for ndt7 specifically. After running the prototype fixer on ~500k total annotations in both cases, we see ~7-10% missed and ~1-2% misattributed.

All annotations: total: 500283 missed: 37168 wrongasn: 6361 NDT7 annotations: total: 515202 missed: 50564 wrongasn: 8705

stephen-soltesz commented 1 year ago

The corrected lookup logic in the uuid-annotator has been complete since 2023-02-10. The reannotation and reprocessing of the ndt.annotation2 and ndt.hopannotation2 data is now complete for all ndt data. See last update to https://github.com/m-lab/data-annotations/issues/34#issuecomment-1538832929.

m-lab / uuid-annotator

Annotation failures #48

Top prefixes /8

Top prefixes /16