Closed agentzh closed 7 years ago
These probably aren't real problems.
More info about the kernel:
agentzh@nuc ~/git/rr/obj $ uname -a
Linux nuc 4.4.0-77-generic #98-Ubuntu SMP Wed Apr 26 08:34:02 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
@rocallahan OK, thanks!
This is the weird bit in switch_processes_32
:
2053: (rr) c
2053: Continuing.
2053:
2053: Program stopped.
2053: 0x70000002 in ?? ()
The program should have run to termination but instead it stopped for no apparent reason. Can you try
rr record <whatever-path>/switch_processes_32
RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000
run 1
y
c
?
@rocallahan Do you mean the following command instead?
_RR_TRACE_DIR=/tmp/rr-test-switch_processes_32-HmJShPEzf RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000
@rocallahan The output is like this:
https://gist.github.com/agentzh/ad4b6a486599c4e7b001af92338c3b5d
The output of the 2nd step is just too much so I only paste the last half.
@rocallahan The real gdb output for your 3 commands above is like this:
0xf77666e5 in ?? () from /lib/ld-linux.so.2
Starting program: /tmp/rr-test-switch_processes_32-HmJShPEzf/target_process_32-HmJShPEzf-0/mmap_hardlink_30_write_race_32-HmJShPEzf 1
Program stopped.
0xf77b7be2 in __kernel_vsyscall ()
Continuing.
Program stopped.
0xf77b7be2 in __kernel_vsyscall ()
Detaching from program: /tmp/rr-test-switch_processes_32-HmJShPEzf/target_process_32-HmJShPEzf-0/mmap_hardlink_30_write_race_32-HmJShPEzf, process 15832
What's at address 0xf77c7b40
in that replay?
I wonder why there's no GdbServer
output in that log...
Or ReplayTimeline
... hmm
@rocallahan Right after running the command
_RR_TRACE_DIR=/tmp/rr-test-switch_processes_32-HmJShPEzf RR_LOG=GdbConnection,GdbServer,ReplayTimeline rr replay -g 1000
I ran the p *0xf77c7b40
command:
(rr) p *0xf77c7b40
[GdbConnection] raw request mf77c7b40,4
[GdbConnection] gdb requests memory (addr=0xf77c7b40, len=4)
[GdbConnection] write_flush: '$E01#a6'
Cannot access memory at address 0xf77c7b40
Is it what you requested?
Thanks for your time and help!
@rocallahan Tried reading 1 byte, still the same thing:
(rr) p *(unsigned char*)0xf77c7b40
[GdbConnection] raw request mf77c7b40,1
[GdbConnection] gdb requests memory (addr=0xf77c7b40, len=1)
[GdbConnection] write_flush: '$E01#a6'
Cannot access memory at address 0xf77c7b40
@rocallahan Right, I can confirm that there is ZERO GdbServer or ReplayTimeline logs when I remove the GdbConnection
tag from the RR_LOG
environment value.
@rocallahan Should I build a custom version of gdb from its official source myself?
No.
I just committed https://github.com/mozilla/rr/commit/a690b026c131b1bb5cef21f404765ade155f6c49 for some more logging; please retry with that and RR_LOG=GdbConnection,GdbServer
.
@rocallahan Git pulled, built, and installed the latest rr from its master. And rerun the example above like this:
https://gist.github.com/agentzh/6a5d8962703fd59bebdfbca2ba51dda6
Hopefully it's helpful.
@rocallahan Do you need shell access to that box? I can send you login details to your email if you want to. I gather it would be easier for you. My email address is agentzh at gmail dot com.
So I just pulled 3640251c9b3ee0949e69954cf6e7ff549132e5f7 and ran the tests exactly the same way you did and they passed:
roc@nuc:~/rr/obj$ ctest --verbose --tests-regex fork_exec_info_thr-32
UpdateCTestConfiguration from :/home/roc/rr/obj/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/roc/rr/obj/DartConfiguration.tcl
Test project /home/roc/rr/obj
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 1967
Start 1967: fork_exec_info_thr-32
1967: Test command: /bin/bash "/home/roc/rr/rr/src/test/fork_exec_info_thr.run" "fork_exec_info_thr_32" "-b" "/home/roc/rr/obj"
1967: Test timeout computed to be: 1000
1967: Targeting recorded pid 6425 at event 338 ...
1967: Test 'fork_exec_info_thr_32' PASSED
1967: Test 'fork_exec_info_thr_32' PASSED
1/2 Test #1967: fork_exec_info_thr-32 ................. Passed 1.89 sec
test 1968
Start 1968: fork_exec_info_thr-32-no-syscallbuf
1968: Test command: /bin/bash "/home/roc/rr/rr/src/test/fork_exec_info_thr.run" "fork_exec_info_thr_32" "-n" "/home/roc/rr/obj"
1968: Test timeout computed to be: 1000
1968: Targeting recorded pid 6504 at event 305 ...
1968: Test 'fork_exec_info_thr_32' PASSED
1968: Test 'fork_exec_info_thr_32' PASSED
2/2 Test #1968: fork_exec_info_thr-32-no-syscallbuf ... Passed 1.80 sec
The following tests passed:
fork_exec_info_thr-32
fork_exec_info_thr-32-no-syscallbuf
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 3.73 sec
roc@nuc:~/rr/obj$ ctest --verbose --tests-regex switch_processes-32
UpdateCTestConfiguration from :/home/roc/rr/obj/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/roc/rr/obj/DartConfiguration.tcl
Test project /home/roc/rr/obj
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 2053
Start 2053: switch_processes-32
2053: Test command: /bin/bash "/home/roc/rr/rr/src/test/switch_processes.run" "switch_processes_32" "-b" "/home/roc/rr/obj"
2053: Test timeout computed to be: 1000
2053: Test 'switch_processes_32' PASSED
1/2 Test #2053: switch_processes-32 ................. Passed 1.88 sec
test 2054
Start 2054: switch_processes-32-no-syscallbuf
2054: Test command: /bin/bash "/home/roc/rr/rr/src/test/switch_processes.run" "switch_processes_32" "-n" "/home/roc/rr/obj"
2054: Test timeout computed to be: 1000
2054: Test 'switch_processes_32' PASSED
2/2 Test #2054: switch_processes-32-no-syscallbuf ... Passed 1.77 sec
The following tests passed:
switch_processes-32
switch_processes-32-no-syscallbuf
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 3.69 sec
You can su
to my account and try them for yourself :-).
@rocallahan Okay, I've tracked it down to be the following environment set in my own ~/.bashrc
file:
export LD_LIBRARY_PATH=/opt/kt/lib:/opt/kc/lib:/opt/dyninst/lib:/opt/dwarf/lib:/usr/lib:/usr/local/lib:/lib
Removing this environment makes the test pass. I wonder why this env plays such a dramatic role here? I've added it to your account's ~/.bashrc
(the last line), and now it is also reproducible in your account :)
Thanks for the hint!
@rocallahan Okay, the minimal env that can make those tests fail is actually like this:
export LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib
And the order of the paths in the env value does not seem to matter at all. But removing any of these paths does make the issue go away. Weird.
Will you please shed some light on this? Thanks!
@rocallahan BTW, with the minimal env, only the fork_exec_info_thr_32
tests are failing.
OK I figured out the switch_process_32
failure at least. Normally:
target_process_32 write_race_32
target_process_32
forks and execs write_race_32
write_race_32
forks a few more processes-g 1000
the debugger attaches to one of those "few more"run 1
the debugger ignores everything before the start of that process, which means skipping over all execsBut with that library path set, for some reason the dynamic loader does a lot more work so when we replay with -g 1000
the debugger attaches to the initial write_race_32
process. Then when we restart with run 1
the debugger goes back to the start of that process, which does do an exec, which triggers the mysterious debugger stop.
@rocallahan Ah, thanks for your detailed analysis! That's interesting. Should we avoid LD_LIBRARY_PATH
env altogether in everyday rr debugging practices?
@rocallahan Or is it something fixable inside rr?
I'll just fix the test.
The fork_exec_info_thr_32
failure is related. The test script tries to pull at most three digits of an event number, but because of the extra dynamic loader work, the actual event number is four digits, so we target the wrong event and the test falls over.
Fixes pushed.
FWIW the tests will run faster without the LD_LIBRARY_PATH
because of the extra work it creates for every process start under rr.
(And no I don't know why setting LD_LIBRARY_PATH
makes the dynamic loader do so many extra system calls, but I assume there's nothing rr can do about it.)
@rocallahan Got it. Fair enough. Thank you so much for your help!
Just for the record, I've just done a full test suite run at the latest master (commit eaa476fedd) and all tests are passing in my Ubuntu atop NUC:
100% tests passed, 0 tests failed out of 2068
Total Test time (real) = 2673.13 sec
Without the LD_LIBRARY_PATH
variable, it indeed runs much faster as well :) Yay!
How can you solve this problem?
What problem?
I built a fresh rr from the latest git master source today (commit 388856644). Running the whole test suite with
make test
gives 4 failures after more than 1 hour's run:Then I ran these test cases individual with full output and they were still failing:
https://gist.github.com/agentzh/2befb55b311067bddb321132426f0958
I'm using the bare metal running Ubuntu 16.04:
The CPU info:
I have 16G DDR4 RAM.
GCC version:
GDB version:
Are these real problems? Are they known issues? Do you need any more details from my side for diagnosing it?
Using rr to debug my own programs does seem to work though.