m-lab / traceroute-caller

A sidecar service which runs traceroute after a connection closes
Apache License 2.0
18 stars 5 forks source link

scamper segfault on mlab2.lga0t #36

Closed yachang closed 4 years ago

yachang commented 4 years ago

a good number scamper segfaults on mlab2.lga0t. Is this expected? Log messages look like: scamper[5963]: segfault at 0 ip 0000558ffcd236f0 sp 00007ffed36cd0f8 error 4 in scamper[558ffcd14000+92000]

yachang commented 4 years ago

https://github.com/m-lab/traceroute-caller/blob/7ed26a9034ace3e7deab17fef2b8680ef516ae34/vendor/scamper/scamper-cvs-20190916/scamper/tbit/scamper_tbit.h#L69

error 4 is "System error" in scamper source code.

yachang commented 4 years ago

I reboot mlab2.lga0t and see whether I can reproduce the segfault.

yachang commented 4 years ago

Reply from Matthew Luckie:

I'll need a core dump in order to debug this further. The tbit code is not in the execution path for how you use scamper. You said some servers, what fraction of servers is this occuring?

To get a core dump, you'll need to

CFLAGS='-g' ./configure --disable-privsep

and then recompile scamper. then, use "ulimit -c unlimited" to ensure the OS will create a core dump.

=====================

yachang commented 4 years ago

code change:

Push to prod:

https://github.com/m-lab/traceroute-caller/releases/tag/v0.3.2 https://github.com/m-lab/k8s-support/pull/322

lga0t stop having segfault with v0.3.1 since yesterday sandbox deployment.

yachang commented 4 years ago

I tried to capture core dump of segfault:

Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00005558d2636b68 in scamper_addr_tostr (sa=0x0, dst=0x7ffd88dfc0b0 "", size=128) at scamper_addr.c:864

After Matthew Luckie got the core dump, he sent as a new tarball w/ fix.

And it is in now: https://github.com/m-lab/traceroute-caller/pull/66

yachang commented 4 years ago

The new Docker Image running on sandbox for almost a day. There was NO Segfault, and the data quality (low % of empty hop trace with file size < 1K) improve dramatically:

Screenshot from 2019-12-19 11-02-00

yachang commented 4 years ago

We can close this issue when this traceroute caller version deployed to prod in new year.

yachang commented 4 years ago

problem fixed:

https://github.com/m-lab/traceroute-caller/releases/tag/v0.6.0

m-lab/k8s-support#347