m-lab / ndt-server

docker native ndt5 and ndt7 server with prometheus integration
https://www.measurementlab.net/
Apache License 2.0
98 stars 39 forks source link

Roughly 5% of tests have erroneous MeanThroughMBPS=0 #282

Open gfr10598 opened 4 years ago

gfr10598 commented 4 years ago

Roughly 5% of the ndt5 results show mpbs = 0. However, the tcpinfo connection records show that these are actually good tests with bandwidth between 0.5Mb/sec an 4Gb/sec, with a median of 70Mb/sec, mean of 100Mb/sec.

This query joins ndt5 download and tcpinfo data, and computes an estimate of throughput from the tcpinfo snapshots.

WITH ndt5 AS ( SELECT partition_date, result.S2C.*, TIMESTAMP_DIFF(result.S2C.EndTime, result.S2C.StartTime, MILLISECOND)/1000 AS duration FROM measurement-lab.ndt.ndt5 WHERE result.S2C IS NOT NULL ),

good_ndt5 AS ( SELECT * FROM ndt5 WHERE duration BETWEEN 9 and 12 --AND MeanThroughputMBPS > 0 -- Why are there a lot of zeros? ),

tcpinfo AS ( SELECT * FROM measurement-lab.ndt.tcpinfo ),

both AS ( SELECT good_ndt5. EXCEPT(UUID, partition_date), tcpinfo.UUID, tcpinfo.partition_date, tcpinfo.Client, tcpinfo.Server, tcpinfo.FinalSnapshot, tcpinfo.Snapshots, Snapshots[OFFSET(0)]AS s0, Snapshots[OFFSET(DIV(ARRAY_LENGTH(tcpinfo.Snapshots)-1,10))]AS s10, Snapshots[OFFSET(DIV(ARRAY_LENGTH(tcpinfo.Snapshots)-1,4))]AS s25, Snapshots[OFFSET(DIV(ARRAY_LENGTH(tcpinfo.Snapshots)-1,2))]AS s50, Snapshots[OFFSET(3DIV(ARRAY_LENGTH(tcpinfo.Snapshots)-1,4))]AS s75, Snapshots[OFFSET(ARRAY_LENGTH(tcpinfo.Snapshots) - DIV(ARRAY_LENGTH(tcpinfo.Snapshots)-1,10))] AS s90, FinalSnapshot AS s100, FROM good_ndt5 JOIN tcpinfo ON good_ndt5.UUID = tcpinfo.UUID AND good_ndt5.partition_date = tcpinfo.partition_date WHERE ARRAY_LENGTH(tcpinfo.Snapshots) > 50 ),

intervals AS ( SELECT partition_date, ARRAY_LENGTH(Snapshots) AS num_snaps, -- EXCEPT(Snapshots, FinalSnapshot, MeanThroughputMBPS, partition_date, duration, s0, s10, s25, s50, s75, s90, s100), FinalSnapshot.TCPInfo.BytesAcked, TIMESTAMP_DIFF(FinalSnapshot.Timestamp, Snapshots[OFFSET(0)].Timestamp, MICROSECOND)/1000000 AS duration, ROUND(MeanThroughputMbps, 3) AS MeanThroughputMBPS, ROUND(8(FinalSnapshot.TCPInfo.BytesAcked - Snapshots[OFFSET(0)].TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(FinalSnapshot.Timestamp, Snapshots[OFFSET(0)].Timestamp, MICROSECOND)), 4) AS fullMBPS, ROUND(8(s100.TCPInfo.BytesAcked - s0.TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(s100.Timestamp, s0.Timestamp, MICROSECOND)), 4) AS MBPSall, ROUND(8(s90.TCPInfo.BytesAcked - s10.TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(s90.Timestamp, s10.Timestamp, MICROSECOND)), 4) AS MBPS9010, ROUND(8(s75.TCPInfo.BytesAcked - s25.TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(s75.Timestamp, s25.Timestamp, MICROSECOND)), 4) AS MBPS7525, ROUND(8(s90.TCPInfo.BytesAcked - s25.TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(s90.Timestamp, s25.Timestamp, MICROSECOND)), 4) AS MBPS9025, ROUND(8*(s90.TCPInfo.BytesAcked - s50.TCPInfo.BytesAcked)/(TIMESTAMP_DIFF(s90.Timestamp, s50.Timestamp, MICROSECOND)), 4) AS MBPS9050, FROM both ),

medians AS ( SELECT partition_date, COUNT(*) AS tests, ROUND(AVG(MeanThroughputMBPS),4) AS mean, APPROX_QUANTILES(MeanThroughputMBPS, 101)[OFFSET(50)] AS median_mean, APPROX_QUANTILES(MBPSall, 101)[OFFSET(50)] AS mbpsALL, APPROX_QUANTILES(MBPS9010, 101)[OFFSET(50)] AS mbps9010, APPROX_QUANTILES(MBPS7525, 101)[OFFSET(50)] AS mbps7525, APPROX_QUANTILES(MBPS9025, 101)[OFFSET(50)] AS mbps9025, APPROX_QUANTILES(MBPS9050, 101)[OFFSET(50)] AS mbps9050, FROM intervals WHERE MeanThroughputMBPS = 0 # This makes a HUGE difference, identifies 5% of tests that look like zero rate. GROUP BY partition_date )

--SELECT * FROM intervals WHERE partition_date = "2020-02-01" AND MeanThroughputMBPS = 0 --ORDER BY fullMBPS DESC --LIMIT 1000

SELECT * FROM medians WHERE partition_date = "2020-03-01"

pboothe commented 4 years ago

We can fix the bad ones in BQ, but we should stop collecting bad data.

Probably a race condition in the S2C code. We should fix the race condition.

pboothe commented 4 years ago

Soltesz found a race condition in S2C

pboothe commented 4 years ago

These fields have "Error" set, when they should not. The test should be marked as successful and have the data saved.

pboothe commented 4 years ago

Possibly addressed by https://github.com/m-lab/ndt-server/pull/285