4 test failures when running the rr test suite in NUC Skull Canyon due to LD_LIBRARY_PATH env

agentzh commented 7 years ago

I built a fresh rr from the latest git master source today (commit 388856644). Running the whole test suite with make test gives 4 failures after more than 1 hour's run:

99% tests passed, 4 tests failed out of 2068

Total Test time (real) = 4129.16 sec

The following tests FAILED:
        1967 - fork_exec_info_thr-32 (Failed)
        1968 - fork_exec_info_thr-32-no-syscallbuf (Failed)
        2053 - switch_processes-32 (Failed)
        2054 - switch_processes-32-no-syscallbuf (Failed)
Errors while running CTest
Makefile:149: recipe for target 'test' failed
make: *** [test] Error 8

Then I ran these test cases individual with full output and they were still failing:

https://gist.github.com/agentzh/2befb55b311067bddb321132426f0958

I'm using the bare metal running Ubuntu 16.04:

agentzh@nuc ~/git/rr/obj $ cat /etc/issue
Ubuntu 16.04.2 LTS \n \l

The CPU info:

agentzh@nuc ~/git/rr/obj $ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 94
model name  : Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
stepping    : 3
microcode   : 0x8a
cpu MHz     : 3441.242
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs        :
bogomips    : 5184.46
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

...

I have 16G DDR4 RAM.

GCC version:

agentzh@nuc ~/git/rr/obj $ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

GDB version:

agentzh@nuc ~/git/rr/obj $ gdb --version
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".

Are these real problems? Are they known issues? Do you need any more details from my side for diagnosing it?

Using rr to debug my own programs does seem to work though.

rocallahan commented 7 years ago

These probably aren't real problems.

agentzh commented 7 years ago

More info about the kernel:

agentzh@nuc ~/git/rr/obj $ uname -a
Linux nuc 4.4.0-77-generic #98-Ubuntu SMP Wed Apr 26 08:34:02 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

agentzh commented 7 years ago

@rocallahan OK, thanks!

rocallahan commented 7 years ago

This is the weird bit in switch_processes_32:

2053: (rr) c
2053: Continuing.
2053: 
2053: Program stopped.
2053: 0x70000002 in ?? ()

The program should have run to termination but instead it stopped for no apparent reason. Can you try

rr record <whatever-path>/switch_processes_32
RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000
run 1
y
c

?

agentzh commented 7 years ago

@rocallahan Do you mean the following command instead?

_RR_TRACE_DIR=/tmp/rr-test-switch_processes_32-HmJShPEzf RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000

agentzh commented 7 years ago

@rocallahan The output is like this:

https://gist.github.com/agentzh/ad4b6a486599c4e7b001af92338c3b5d

The output of the 2nd step is just too much so I only paste the last half.

agentzh commented 7 years ago

@rocallahan The real gdb output for your 3 commands above is like this:

0xf77666e5 in ?? () from /lib/ld-linux.so.2
Starting program: /tmp/rr-test-switch_processes_32-HmJShPEzf/target_process_32-HmJShPEzf-0/mmap_hardlink_30_write_race_32-HmJShPEzf 1

Program stopped.
0xf77b7be2 in __kernel_vsyscall ()
Continuing.

Program stopped.
0xf77b7be2 in __kernel_vsyscall ()
Detaching from program: /tmp/rr-test-switch_processes_32-HmJShPEzf/target_process_32-HmJShPEzf-0/mmap_hardlink_30_write_race_32-HmJShPEzf, process 15832

rocallahan commented 7 years ago

What's at address 0xf77c7b40 in that replay?

rocallahan commented 7 years ago

I wonder why there's no GdbServer output in that log...

rocallahan commented 7 years ago

Or ReplayTimeline ... hmm

agentzh commented 7 years ago

@rocallahan Right after running the command

_RR_TRACE_DIR=/tmp/rr-test-switch_processes_32-HmJShPEzf RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000

I ran the p *0xf77c7b40 command:

(rr) p *0xf77c7b40
[GdbConnection] raw request mf77c7b40,4
[GdbConnection] gdb requests memory (addr=0xf77c7b40, len=4)
[GdbConnection] write_flush: '$E01#a6'
Cannot access memory at address 0xf77c7b40

Is it what you requested?

Thanks for your time and help!

agentzh commented 7 years ago

@rocallahan Tried reading 1 byte, still the same thing:

(rr) p *(unsigned char*)0xf77c7b40
[GdbConnection] raw request mf77c7b40,1
[GdbConnection] gdb requests memory (addr=0xf77c7b40, len=1)
[GdbConnection] write_flush: '$E01#a6'
Cannot access memory at address 0xf77c7b40

agentzh commented 7 years ago

@rocallahan Right, I can confirm that there is ZERO GdbServer or ReplayTimeline logs when I remove the GdbConnection tag from the RR_LOG environment value.

agentzh commented 7 years ago

@rocallahan Should I build a custom version of gdb from its official source myself?

rocallahan commented 7 years ago

No.

I just committed https://github.com/mozilla/rr/commit/a690b026c131b1bb5cef21f404765ade155f6c49 for some more logging; please retry with that and RR_LOG=GdbConnection,GdbServer.

agentzh commented 7 years ago

@rocallahan Git pulled, built, and installed the latest rr from its master. And rerun the example above like this:

https://gist.github.com/agentzh/6a5d8962703fd59bebdfbca2ba51dda6

Hopefully it's helpful.

agentzh commented 7 years ago

@rocallahan Do you need shell access to that box? I can send you login details to your email if you want to. I gather it would be easier for you. My email address is agentzh at gmail dot com.

rocallahan commented 7 years ago

So I just pulled 3640251c9b3ee0949e69954cf6e7ff549132e5f7 and ran the tests exactly the same way you did and they passed:

roc@nuc:~/rr/obj$ ctest --verbose --tests-regex fork_exec_info_thr-32
UpdateCTestConfiguration  from :/home/roc/rr/obj/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/roc/rr/obj/DartConfiguration.tcl
Test project /home/roc/rr/obj
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 1967
    Start 1967: fork_exec_info_thr-32

1967: Test command: /bin/bash "/home/roc/rr/rr/src/test/fork_exec_info_thr.run" "fork_exec_info_thr_32" "-b" "/home/roc/rr/obj"
1967: Test timeout computed to be: 1000
1967: Targeting recorded pid 6425 at event 338 ...
1967: Test 'fork_exec_info_thr_32' PASSED
1967: Test 'fork_exec_info_thr_32' PASSED
1/2 Test #1967: fork_exec_info_thr-32 .................   Passed    1.89 sec
test 1968
    Start 1968: fork_exec_info_thr-32-no-syscallbuf

1968: Test command: /bin/bash "/home/roc/rr/rr/src/test/fork_exec_info_thr.run" "fork_exec_info_thr_32" "-n" "/home/roc/rr/obj"
1968: Test timeout computed to be: 1000
1968: Targeting recorded pid 6504 at event 305 ...
1968: Test 'fork_exec_info_thr_32' PASSED
1968: Test 'fork_exec_info_thr_32' PASSED
2/2 Test #1968: fork_exec_info_thr-32-no-syscallbuf ...   Passed    1.80 sec

The following tests passed:
    fork_exec_info_thr-32
    fork_exec_info_thr-32-no-syscallbuf

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   3.73 sec
roc@nuc:~/rr/obj$ ctest --verbose --tests-regex switch_processes-32
UpdateCTestConfiguration  from :/home/roc/rr/obj/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/roc/rr/obj/DartConfiguration.tcl
Test project /home/roc/rr/obj
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 2053
    Start 2053: switch_processes-32

2053: Test command: /bin/bash "/home/roc/rr/rr/src/test/switch_processes.run" "switch_processes_32" "-b" "/home/roc/rr/obj"
2053: Test timeout computed to be: 1000
2053: Test 'switch_processes_32' PASSED
1/2 Test #2053: switch_processes-32 .................   Passed    1.88 sec
test 2054
    Start 2054: switch_processes-32-no-syscallbuf

2054: Test command: /bin/bash "/home/roc/rr/rr/src/test/switch_processes.run" "switch_processes_32" "-n" "/home/roc/rr/obj"
2054: Test timeout computed to be: 1000
2054: Test 'switch_processes_32' PASSED
2/2 Test #2054: switch_processes-32-no-syscallbuf ...   Passed    1.77 sec

The following tests passed:
    switch_processes-32
    switch_processes-32-no-syscallbuf

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   3.69 sec

rocallahan commented 7 years ago

You can su to my account and try them for yourself :-).

agentzh commented 7 years ago

@rocallahan Okay, I've tracked it down to be the following environment set in my own ~/.bashrc file:

export LD_LIBRARY_PATH=/opt/kt/lib:/opt/kc/lib:/opt/dyninst/lib:/opt/dwarf/lib:/usr/lib:/usr/local/lib:/lib

Removing this environment makes the test pass. I wonder why this env plays such a dramatic role here? I've added it to your account's ~/.bashrc (the last line), and now it is also reproducible in your account :)

Thanks for the hint!

agentzh commented 7 years ago

@rocallahan Okay, the minimal env that can make those tests fail is actually like this:

export LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib

And the order of the paths in the env value does not seem to matter at all. But removing any of these paths does make the issue go away. Weird.

Will you please shed some light on this? Thanks!

agentzh commented 7 years ago

@rocallahan BTW, with the minimal env, only the fork_exec_info_thr_32 tests are failing.

rocallahan commented 7 years ago

OK I figured out the switch_process_32 failure at least. Normally:

We run target_process_32 write_race_32
target_process_32 forks and execs write_race_32
write_race_32 forks a few more processes
when we replay with -g 1000 the debugger attaches to one of those "few more"
when we restart with run 1 the debugger ignores everything before the start of that process, which means skipping over all execs

But with that library path set, for some reason the dynamic loader does a lot more work so when we replay with -g 1000 the debugger attaches to the initial write_race_32 process. Then when we restart with run 1 the debugger goes back to the start of that process, which does do an exec, which triggers the mysterious debugger stop.

agentzh commented 7 years ago

@rocallahan Ah, thanks for your detailed analysis! That's interesting. Should we avoid LD_LIBRARY_PATH env altogether in everyday rr debugging practices?

agentzh commented 7 years ago

@rocallahan Or is it something fixable inside rr?

rocallahan commented 7 years ago

I'll just fix the test.

rocallahan commented 7 years ago

The fork_exec_info_thr_32 failure is related. The test script tries to pull at most three digits of an event number, but because of the extra dynamic loader work, the actual event number is four digits, so we target the wrong event and the test falls over.

rocallahan commented 7 years ago

Fixes pushed.

rocallahan commented 7 years ago

FWIW the tests will run faster without the LD_LIBRARY_PATH because of the extra work it creates for every process start under rr.

rocallahan commented 7 years ago

(And no I don't know why setting LD_LIBRARY_PATH makes the dynamic loader do so many extra system calls, but I assume there's nothing rr can do about it.)

agentzh commented 7 years ago

@rocallahan Got it. Fair enough. Thank you so much for your help!

agentzh commented 7 years ago

Just for the record, I've just done a full test suite run at the latest master (commit eaa476fedd) and all tests are passing in my Ubuntu atop NUC:

100% tests passed, 0 tests failed out of 2068

Total Test time (real) = 2673.13 sec

Without the LD_LIBRARY_PATH variable, it indeed runs much faster as well :) Yay!

junglefish8086 commented 7 years ago

How can you solve this problem?

rocallahan commented 7 years ago

What problem?

rr-debugger / rr

4 test failures when running the rr test suite in NUC Skull Canyon due to LD_LIBRARY_PATH env #2028