Spurious segfault during report generation (?)

ingomueller-net commented 4 years ago

I am running a CI target of a native Python module compiled with -fno-omit-frame-pointer -fsanitize=address,undefined -fno-sanitize=vptr. I am using stock Python and preload ASan as described here.

Since a few days, sporadically, this happens:

...
=========== 2196 passed, 14 skipped, 20 warnings in 4249.02 seconds ============
Tracer caught signal 11: addr=0x0 pc=0x7f7860e5ac39 sp=0x7f783a61ed30
==413==LeakSanitizer has encountered a fatal error.
==413==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==413==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

The line starting with ==== is usually the last line executed by the program. When I rerun the test target, it usually completes without problem.

How can I debug this? I can't run the program in gdb (or can I?), I can't make it produce a core dump when it segfaults (or can I?), I can't make it print a stack trace when it segfaults (or can I?), so how I can find out what is happening? The only information I seem to get is pc=0x7f7860e5ac39 sp=0x7f783a61ed30. What do those mean?

I have tried ASAN_OPTIONS=handle_segv=0 and similar, but none changed the behavior.

Note that our CI runs a single process (Python running pytest) that runs for about 72 minutes. I somehow suspect that it fails because "it runs for too long"; at least, that would explain why this has started to happen over time with no apparent change other than adding more tests...

ingomueller-net commented 4 years ago

I am now running into this problem again. This time, all commits after one specific commit fail with above error every single time. Interestingly, that commit has (seemingly?) nothing to do with the C++ module I am debugging and just changes some imports of Python (!) modules.

However, the problem only occurs if run by the Gitlab CI runner. I have tried reproducing it with the same docker image and running the same test, but that works. I have even tried logging into the running docker container and running the same test manually that CI would also run by copying all environment variables from the original, concurrently running CI job (as described here) -- my manual invocation works but the CI job fails with the above error.

Also, I have tried the sanitizers in LLVM 11 with the same result.

I suspect that some random factor like memory layout or similar changes whether or not the problem occurs.

The important question is: how can I debug this further?

ingomueller-net commented 4 years ago

The work-around described in #1322 to set ASAN_OPTIONS=intercept_tls_get_addr=0 seems to be working for me. Thanks, @InverseRE, for linking to my issue!

ingomueller-net commented 4 years ago

Another piece of information that may be useful to somebody: All previous attempts (which failed) were carried out with Docker images based on Ubuntu bionic, which uses glibc v2.27. I just now updated to Ubuntu focal, which uses glibc v2.31, where I get the same behaviour.

google / sanitizers

Spurious segfault during report generation (?) #1267