m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

traceroute parser panic over traceroute tests #775

Open yachang opened 4 years ago

yachang commented 4 years ago

[FIRING:1] ETL_ParserPanicNonZero An ETL parser panicked aef-etl--traceroute--parser-20191121t163921-8jvx Bugs cause panics. This bug should be fixed. Parsers run in AppEngine. Check logs to see the panic stack trace. Identify the archive that led to the panic (logs or TaskQueue tasks with many retries). Fix the bug or create a new issue describing the failure and linking to the triggering archive.(edit...) alertname=ETL_ParserPanicNonZero, instance=aef-etl--traceroute--parser-20191121t163921-8jvx, job=aeflex-targets, max_total_instances=20, project=mlab-staging, public_protocol=tcp, repo=dev-tracker, service=etl-traceroute-parser, severity=ticket, source=worker, version=20191121t163921, vm_debug_enabled=false,

========================================

Here is the log:

2019-12-16T16:37:26Z 2019/12/16 16:37:26 metrics.go:590: Adding metrics for panic: runtime error: slice bounds out of range A 2019-12-16T16:37:26Z goroutine 748707 [running]: A 2019-12-16T16:37:26Z runtime/debug.Stack(0xe7a360, 0xc0006eda40, 0xc0006eda40) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/debug/stack.go:24 +0x9d A 2019-12-16T16:37:26Z runtime/debug.PrintStack() A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/debug/stack.go:16 +0x22 A 2019-12-16T16:37:26Z github.com/m-lab/etl/metrics.CountPanics(0xc35260, 0x15c7ec0, 0xd3c173, 0x6) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/metrics/metrics.go:592 +0x261 A 2019-12-16T16:37:26Z main.handleRequest.func1() A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:109 +0x58 A 2019-12-16T16:37:26Z panic(0xc35260, 0x15c7ec0) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5 A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.ParseFirstLine(0xc00001cb00, 0x54, 0x1, 0x0, 0x0, 0xb, 0xc00058e180, 0xc, 0xc, 0x8b6942) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:405 +0x6ac A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.Parse(0xc000814720, 0xc000036120, 0x55, 0xc00007c0e0, 0x64, 0xc000218600, 0x2b2, 0x600, 0xd3facd, 0xa, ...) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:750 +0x11d9 A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.(PTParser).ParseAndInsert(0xc000656070, 0xc000814720, 0xc000036120, 0x55, 0xc000218600, 0x2b2, 0x600, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:537 +0x80a A 2019-12-16T16:37:26Z github.com/m-lab/etl/task.(Task).ProcessAllTests(0xc000814780, 0x0, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/task/task.go:120 +0xf23 A 2019-12-16T16:37:26Z github.com/m-lab/etl/worker.ProcessTask(0xc000162c40, 0x6a, 0x0, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/worker/worker.go:90 +0xe87 A 2019-12-16T16:37:26Z main.subworker(0xc00040cab0, 0x90, 0x0, 0x0, 0xc7b4bd2ba, 0x0, 0xc0010b5900, 0x18) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:179 +0x1f4 A 2019-12-16T16:37:26Z main.handleRequest(0xe70dc0, 0xc0010b5900, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:158 +0x38c A 2019-12-16T16:37:26Z net/http.HandlerFunc.ServeHTTP(...) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1995 A 2019-12-16T16:37:26Z github.com/m-lab/etl/metrics.DurationHandler.func1(0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/metrics/metrics.go:564 +0xdd A 2019-12-16T16:37:26Z net/http.HandlerFunc.ServeHTTP(0xc000416c00, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1995 +0x44 A 2019-12-16T16:37:26Z net/http.(ServeMux).ServeHTTP(0x1781680, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2375 +0x1d6 A 2019-12-16T16:37:26Z net/http.serverHandler.ServeHTTP(0xc00011f520, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2774 +0xab A 2019-12-16T16:37:26Z net/http.(conn).serve(0xc00080b7c0, 0xe738c0, 0xc00188f540) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1878 +0x84c A 2019-12-16T16:37:26Z created by net/http.(Server).Serve A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2884 +0x2f4 A 2019-12-16T16:37:26Z 2019/12/16 16:37:26 server.go:3012: http: panic serving 172.17.0.4:7896: runtime error: slice bounds out of range A 2019-12-16T16:37:26Z goroutine 748707 [running]: A 2019-12-16T16:37:26Z net/http.(conn).serve.func1(0xc00080b7c0) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1769 +0xc9 A 2019-12-16T16:37:26Z panic(0xc35260, 0x15c7ec0) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5 A 2019-12-16T16:37:26Z github.com/m-lab/etl/metrics.CountPanics(0xc35260, 0x15c7ec0, 0xd3c173, 0x6) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/metrics/metrics.go:593 +0x27f A 2019-12-16T16:37:26Z main.handleRequest.func1() A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:109 +0x58 A 2019-12-16T16:37:26Z panic(0xc35260, 0x15c7ec0) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5 A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.ParseFirstLine(0xc00001cb00, 0x54, 0x1, 0x0, 0x0, 0xb, 0xc00058e180, 0xc, 0xc, 0x8b6942) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:405 +0x6ac A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.Parse(0xc000814720, 0xc000036120, 0x55, 0xc00007c0e0, 0x64, 0xc000218600, 0x2b2, 0x600, 0xd3facd, 0xa, ...) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:750 +0x11d9 A 2019-12-16T16:37:26Z github.com/m-lab/etl/parser.(PTParser).ParseAndInsert(0xc000656070, 0xc000814720, 0xc000036120, 0x55, 0xc000218600, 0x2b2, 0x600, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/parser/pt.go:537 +0x80a A 2019-12-16T16:37:26Z github.com/m-lab/etl/task.(Task).ProcessAllTests(0xc000814780, 0x0, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/task/task.go:120 +0xf23 A 2019-12-16T16:37:26Z github.com/m-lab/etl/worker.ProcessTask(0xc000162c40, 0x6a, 0x0, 0x0, 0x0) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/worker/worker.go:90 +0xe87 A 2019-12-16T16:37:26Z main.subworker(0xc00040cab0, 0x90, 0x0, 0x0, 0xc7b4bd2ba, 0x0, 0xc0010b5900, 0x18) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:179 +0x1f4 A 2019-12-16T16:37:26Z main.handleRequest(0xe70dc0, 0xc0010b5900, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/cmd/etl_worker/etl_worker.go:158 +0x38c A 2019-12-16T16:37:26Z net/http.HandlerFunc.ServeHTTP(...) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1995 A 2019-12-16T16:37:26Z github.com/m-lab/etl/metrics.DurationHandler.func1(0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/gopath/src/github.com/m-lab/etl/metrics/metrics.go:564 +0xdd A 2019-12-16T16:37:26Z net/http.HandlerFunc.ServeHTTP(0xc000416c00, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1995 +0x44 A 2019-12-16T16:37:26Z net/http.(ServeMux).ServeHTTP(0x1781680, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2375 +0x1d6 A 2019-12-16T16:37:26Z net/http.serverHandler.ServeHTTP(0xc00011f520, 0xe71b80, 0xc000264380, 0xc000143800) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2774 +0xab A 2019-12-16T16:37:26Z net/http.(conn).serve(0xc00080b7c0, 0xe738c0, 0xc00188f540) A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:1878 +0x84c A 2019-12-16T16:37:26Z created by net/http.(*Server).Serve A 2019-12-16T16:37:26Z /home/travis/.gimme/versions/go1.12.linux.amd64/src/net/http/server.go:2884 +0x2f4 A

yachang commented 4 years ago

The reason is that in staging, we deploy a latest paris-traceroute binary, which generate test w/ different format compared to legacy paris traceroute output:

sample of old one:

=================

traceroute [(163.7.129.76:33465) -> (193.188.22.20:1004)], protocol icmp, algo exhaustive, duration 23 s 1 P(6, 6) 163.7.129.65 (163.7.129.65) 0.168/0.193/0.242/0.024 ms 2 P(6, 6) UNASSIGNED.core.vf.net.nz (203.109.152.30) 10.756/10.854/11.049/0.108 ms 3 P(6, 6) UNASSIGNED.core.vf.net.nz (203.109.152.29) 11.683/11.812/11.909/0.073 ms 4 P(6, 6) 134.159.174.29 (134.159.174.29) 13.139/14.388/15.946/0.996 ms 5 P(6, 6) i-0-4-0-1.tauc-core02.bi.telstraglobal.net (202.84.223.105) 12.981/15.160/15.876/0.996 ms 6 P(6, 6) 202.84.138.82 (202.84.138.82) 139.602/375.086/653.291/201.867 ms 7 P(6, 6) i-0-6-0-1.eqnx-core02.telstraglobal.net (202.84.143.206) 145.950/146.887/147.722/0.619 ms 8 P(6, 6) i-92.eqnx03.telstraglobal.net (202.84.247.17) 145.654/145.860/146.516/0.297 ms

====================

sample of new one:

traceroute to 35.229.17.149 (35.229.17.149), 30 hops max, 30 bytes packets 1 38.109.21.1 (38.109.21.1) 0.481ms 0.498ms 0.518ms
2 be2369.ccr21.mia03.atlas.cogentco.com (154.54.45.157) 0.558ms 0.554ms 1.108ms
3 tata.mia03.atlas.cogentco.com (154.54.9.46) 0.370ms 0.365ms 0.369ms
4 72.14.215.97 (72.14.215.97) 1.148ms 1.157ms 1.174ms
5 108.170.253.18 (108.170.253.18) 1.049ms 1.019ms 1.022ms
6 108.170.231.84 (108.170.231.84) 1.761ms 1.758ms 1.762ms
7 72.14.236.94 (72.14.236.94) 26.882ms 26.891ms 26.910ms
8 216.239.35.187 (216.239.35.187) 29.932ms 29.911ms 29.907ms
9 10 11 *

===========================

So the parser simply crash during parsing the new data.

pboothe commented 4 years ago

New Paris Traceroute fallback has a different output format.

pboothe commented 4 years ago

Addressed by not using paris-traceroute as a backup for now, but instead using scamper in single-run mode instead of in daemon mode.

https://github.com/m-lab/traceroute-caller/pull/64

This is still open, because the formats don't match, but it is now P3 or P4.

gfr10598 commented 4 years ago

Please ensure that we resolve the code bug that resulted in a panic. Instead, it should result in an error metric and an "ERROR:" log message.