m-lab / traceroute-caller

A sidecar service which runs traceroute after a connection closes
Apache License 2.0
18 stars 5 forks source link

Number of tests drop dramatically compared to paris traceroute binary #27

Closed yachang closed 4 years ago

yachang commented 4 years ago

I compare the site traffic in 2019/05 (Paris Tracerout) vs. 2019/08 (scamper):

mlab3-den04: 609300, 63800 mlab3-beg01: 660600, 91800 mlab2-mil02: 465900, 149437 mlab2-sea06: 260300, 21000

One possible reason is that scamper test took longer time to complete compared to PT. So the 60 seconds timeout will make most of scamper test fail in the middle.

Attached BigQuery used:

legacySQL

SELECT

COUNT(TestTime) AS num

FROM (

SELECT TestTime

FROM [mlab-oti.batch.traceroute]

WHERE DATE(_PARTITIONTIME) BETWEEN DATE("2019-05-01") AND DATE("2019-06-01") AND uuid = "" AND Parseinfo.TaskFileName CONTAINS "mlab2-sea06"

)

====================

legacySQL

SELECT

COUNT(uuid) AS num

FROM (

SELECT uuid

FROM [mlab-oti.batch.traceroute]

WHERE   DATE(_PARTITIONTIME) BETWEEN DATE("2019-08-01") AND DATE("2019-09-01")   AND Parseinfo.TaskFileName CONTAINS "mlab2-sea06" )

stephen-soltesz commented 4 years ago

Factors to review:

stephen-soltesz commented 4 years ago
yachang commented 4 years ago

I will do the following things:

  1. sync the traceroute tests from other experiments besides NDT from pusher bucket to archive bucket. And make sure ETL process those tests

  2. compare the NDT tests on new platform, and verify that most of them got corresponding traceroute tests.

pboothe commented 4 years ago

How many new-platform S2C tests in September have a traceroute with the same UUID as the s2c channel's UUID?

SELECT COUNT(*) as total
FROM `measurement-lab.ndt.ndt5` as ndt5 INNER JOIN `mlab-oti.base_tables.traceroute` as troute on ndt5.result.s2c.uuid = troute.uuid
WHERE result.S2C IS NOT NULL
  AND ndt5.partition_date BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND ndt5.result.S2C.uuid IS NOT NULL
  AND ndt5.result.S2C.uuid != ""
  AND ndt5.result.Control.uuid IS NOT NULL
  AND ndt5.result.Control.uuid != ""
  AND EXTRACT(DATE FROM troute._PARTITIONTIME) BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND troute.uuid IS NOT NULL
  AND troute.uuid != ""

3,922,466

How many new-platform S2C tests in September have a traceroute with the same UUID as the control channel's UUID?

SELECT COUNT(*) as total
FROM `measurement-lab.ndt.ndt5` as ndt5 INNER JOIN `mlab-oti.base_tables.traceroute` as troute on ndt5.result.control.uuid = troute.uuid
WHERE result.S2C IS NOT NULL
  AND ndt5.partition_date BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND ndt5.result.S2C.uuid IS NOT NULL
  AND ndt5.result.S2C.uuid != ""
  AND ndt5.result.Control.uuid IS NOT NULL
  AND ndt5.result.Control.uuid != ""
  AND EXTRACT(DATE FROM troute._PARTITIONTIME) BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND troute.uuid IS NOT NULL
  AND troute.uuid != ""

3,952,256

How many new-platform S2C tests in September were there total?

SELECT COUNT(*) as total
FROM `measurement-lab.ndt.ndt5` as ndt5
WHERE result.S2C IS NOT NULL
  AND ndt5.partition_date BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND ndt5.result.S2C.uuid IS NOT NULL
  AND ndt5.result.S2C.uuid != ""
  AND ndt5.result.Control.uuid IS NOT NULL
  AND ndt5.result.Control.uuid != ""

16,376,595

We expect that a given IP will only have one traceroute performed to it, and we don't cache and return traceroute results (currently filing a bug about that) so these two sets should be non-overlapping. This indicates that (3,922,466 + 3,952,256) / 16,376,595 = 0.48085 = 48% of tests have a corresponding traceroute.

When I rewrote the query to ask "How many S2C test IPs have a corresponding traceroute in September?", using the following query:

SELECT COUNT(*) as total
FROM `measurement-lab.ndt.ndt5` as ndt5 LEFT JOIN `mlab-oti.base_tables.traceroute` as troute on ndt5.result.s2c.ClientIP = troute.Destination.IP
WHERE result.S2C IS NOT NULL
  AND ndt5.partition_date BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND ndt5.result.S2C.uuid IS NOT NULL
  AND ndt5.result.S2C.uuid != ""
  AND ndt5.result.ClientIP IS NOT NULL
  AND ndt5.result.ClientIP != ""
  AND ndt5.result.Control.uuid IS NOT NULL
  AND ndt5.result.Control.uuid != ""
  AND EXTRACT(DATE FROM troute._PARTITIONTIME) BETWEEN DATE("2019-09-01") AND DATE("2019-10-01")
  AND troute.uuid IS NOT NULL
  AND troute.uuid != ""
  AND troute.Destination.IP IS NOT NULL
  AND troute.Destination.IP != ""

15,244,749

The answer was 15,244,749, and 15,244,749 / 16,376,595 = .930886 = 93%

So, 93% of NDT test IPs had a corresponding traceroute. The sky is definitely not falling, but traceroute-caller has some bugs that can accidentally make it look like the sky is falling. The new platform is still okay.

pboothe commented 4 years ago

Added #28 and #29 as follow-up issues.

mehtab11 commented 4 years ago

Fixed; closing it.