gfr10598 commented 3 years ago

I made two changes - adding a metric for trace latency, and changing panics to errors.

In testing on sandbox, I noticed that roughly 90% of traces are timing out:

TYPE trace_time_seconds histogram

trace_time_seconds_bucket{outcome="error",le="10"} 0 trace_time_seconds_bucket{outcome="error",le="21.5"} 0 trace_time_seconds_bucket{outcome="error",le="46.4"} 0 trace_time_seconds_bucket{outcome="error",le="100"} 0 trace_time_seconds_bucket{outcome="error",le="215"} 0 trace_time_seconds_bucket{outcome="error",le="464"} 90 trace_time_seconds_bucket{outcome="error",le="1000"} 90 trace_time_seconds_count{outcome="error"} 90 trace_time_seconds_bucket{outcome="success",le="10"} 1 trace_time_seconds_bucket{outcome="success",le="21.5"} 2 trace_time_seconds_bucket{outcome="success",le="46.4"} 2 trace_time_seconds_bucket{outcome="success",le="100"} 3 trace_time_seconds_bucket{outcome="success",le="215"} 8 trace_time_seconds_bucket{outcome="success",le="464"} 12 trace_time_seconds_bucket{outcome="success",le="1000"} 12 trace_time_seconds_count{outcome="success"} 12

I'm running again after fixing a bug in error handling, but I expect that we have been overlooking this problem for some time, and unaware of it because traceroute parsing has also been having a lot of problems.

gfr10598 commented 3 years ago

After changing timeout to 30 minutes, there are still only < 20% of traces succeeding, with a good distribution of times.

trace_time_seconds_bucket{outcome="success",le="4.64"} 0 trace_time_seconds_bucket{outcome="success",le="6.81"} 19 trace_time_seconds_bucket{outcome="success",le="10"} 43 trace_time_seconds_bucket{outcome="success",le="14.7"} 49 trace_time_seconds_bucket{outcome="success",le="21.5"} 57 trace_time_seconds_bucket{outcome="success",le="31.6"} 60 trace_time_seconds_bucket{outcome="success",le="46.4"} 63 trace_time_seconds_bucket{outcome="success",le="68.1"} 71 trace_time_seconds_bucket{outcome="success",le="100"} 82 trace_time_seconds_bucket{outcome="success",le="147"} 90 trace_time_seconds_bucket{outcome="success",le="215"} 97 trace_time_seconds_bucket{outcome="success",le="316"} 105 trace_time_seconds_bucket{outcome="success",le="464"} 133 trace_time_seconds_bucket{outcome="success",le="681"} 138 trace_time_seconds_bucket{outcome="success",le="1000"} 147 trace_time_seconds_bucket{outcome="success",le="1470"} 156 trace_time_seconds_bucket{outcome="success",le="2150"} 200

trace_time_seconds_count{outcome="success"} 200 traces_crashed_total{type="scamper"} 1084 traces_in_progress{type="scamper-daemon"} 41 traces_performed_total{type="scamper-daemon"} 1284 traces_scamper_daemon_running 1 traces_skipped_total{type="scamper"} 4498

Looking at logs, it appears that the majority of the failures are from our OAM ip addresses.

gfr10598 commented 3 years ago

Tried switching to scamper instead of scamper-daemon, and also changed cache timeout to 30m. Seems to work much better! No failures, and generally much faster completion. 13 traces have all completed within less than 12 minutes, whereas with daemon, about 25% of successful traces took more than 24 minutes, and 80% failed.

However, it looks like the scamper processes are being left behind (looking with ps) as . Looks like the pipe library isn't reading the status, so the process isn't getting cleaned up.

trace_time_seconds_bucket{outcome="success",le="46.4"} 0 trace_time_seconds_bucket{outcome="success",le="68.1"} 1 trace_time_seconds_bucket{outcome="success",le="100"} 2 trace_time_seconds_bucket{outcome="success",le="147"} 3 trace_time_seconds_bucket{outcome="success",le="215"} 5 trace_time_seconds_bucket{outcome="success",le="316"} 8 trace_time_seconds_bucket{outcome="success",le="464"} 10 trace_time_seconds_bucket{outcome="success",le="681"} 13 trace_time_seconds_bucket{outcome="success",le="1000"} 13

trace_time_seconds_count{outcome="success"} 13 traces_in_progress{type="scamper"} 2

gfr10598 commented 3 years ago

One machine has done 209 traces, no timeouts, and all under 12 minutes, without daemon, 30 minute cache timeout. There are also 209 defunct scamper processes. 8-(

Trying pipe.Run instead of pipe.RunTimeout makes no difference.

gfr10598 commented 3 years ago

I've mistakenly put a lot of comments that should go here in #107 instead. I've manually started a few variations on sandbox-saied-trace in sandbox, on host, neubot, and ndt pods.

host pods currently look better, but this is likely because I changed the ipcache timeout to 30 minutes. It seems to have vastly reduced the number of traces.

In the ndt pods, I have gfr-params branch running, with 2h timeout, and -w 2, which allows just 20 msec between probes. I expect the latter is having a big impact - no timeouts, much shorter trace times (average about 5 minutes), many fewer concurrent traces, but roughly the same number of traces per minute.

gfr10598 commented 3 years ago

I've just pushed to host pods with some additional code changes, 90 minute timeout, 30 minute cache timeout, -w 2 param.

SaiedKazemi commented 3 years ago

This issue is most likely related to how scamper handles its request queue in daemon mode because we now see only less than 0.5% timeouts in staging running traceroute-caller v0.8.4 which invokes different instances of scamper for each request. We will deploy this version to production soon.

I am closing this issue now. In the unlikely case that we see timeouts again, we can either reopen this issue or file a new one.

m-lab / traceroute-caller

Traces in traceroute-caller may be timing out a lot #102

TYPE trace_time_seconds histogram