libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
651 stars 286 forks source link

Debug with gdb: can't debug when run program by mpirun -np x ./xx #2324

Open wgm096350 opened 4 years ago

wgm096350 commented 4 years ago

I used systems_of_equations_ex5.C as the test axample. For multi-thread debugging, I inserted this additional code snippet into it right after LibMeshInit init (argc, argv);.
int gdb_break = 1; while(gdb_break) {}; This code snippet represents an "always right" while cycle so that every thread would stop at here and we can attach [gdb]s to these thread. This is a common workaround when debugging MPI program on Linux. After making systems_of_equations_ex5, I execute it with command mpirun -np 1 --allow-run-as-root ./systems_of_equations_ex5 -ksp_type cg and it certainly “stuck“ at the while cycle. Then I run gdb at another terminal. In the gdb I use command !ps -all |grep systems_of to find the process generated by mpirun, let's say it is 10056. Then still in gdb, use attach 10056 to attach gdb to the process. But here is the problem, gdb can't reach the while cycle instead it is waiting something as shown in this picture. 批注 2019-11-08 213730

I found this problem when using mpirun -n 4,but it also occurs when -n 1. I need some suggestions on debugging with gdb and is eager to hear your kind replys.

roystgnr commented 4 years ago

I don't know how to fix your problem ... though I am curious about it. Is it possible your MPI implementation spawns some kind of helper process that it gives the same name? But even so, "The first thing GDB does after arranging to debug the specified process is to stop it" and that should be happening even if it's the wrong process somehow. Is it possible the process you attached to is a libtool script rather than its underlying executable? But at least my version of libtool eventually does an "exec", it doesn't fork off a child process... and either way there still should be something getting stopped, right? What happens if you Ctrl-C? Does it stop? Can you get a stack trace if it does?

Anyway, I do have a suggested workaround: you can always run programs through gdb rather than attaching it later. E.g. for small-scale parallel debugging problems I tend to use LIBMESH_RUN='mpirunxterm -np 2 libtool --mode=execute gdb --args' - the "mpirunxterm" bit is a short script to give each process its own window:

#!/bin/bash

numproc=$2

shift
shift

exec mpirun -np $numproc xterm -e bash -c "$* && read -p COMPLETED finalinput || read -p FAILED finalinput; sleep 20"

The libtool command makes sure that gdb gets applied to the underlying binary and not to any scripts if I'm working on pre-installed programs. The LIBMESH_RUN environment variable gets used by internal libMesh "make check" etc. but if you're running something directly you can just prepend that to your command yourself.

wgm096350 commented 4 years ago

I don't know how to fix your problem ... though I am curious about it. Is it possible your MPI implementation spawns some kind of helper process that it gives the same name? But even so, "The first thing GDB does after arranging to debug the specified process is to stop it" and that should be happening even if it's the wrong process somehow. Is it possible the process you attached to is a libtool script rather than its underlying executable? But at least my version of libtool eventually does an "exec", it doesn't fork off a child process... and either way there still should be something getting stopped, right? What happens if you Ctrl-C? Does it stop? Can you get a stack trace if it does?

Anyway, I do have a suggested workaround: you can always run programs through gdb rather than attaching it later. E.g. for small-scale parallel debugging problems I tend to use LIBMESH_RUN='mpirunxterm -np 2 libtool --mode=execute gdb --args' - the "mpirunxterm" bit is a short script to give each process its own window:

#!/bin/bash

numproc=$2

shift
shift

exec mpirun -np $numproc xterm -e bash -c "$* && read -p COMPLETED finalinput || read -p FAILED finalinput; sleep 20"

The libtool command makes sure that gdb gets applied to the underlying binary and not to any scripts if I'm working on pre-installed programs. The LIBMESH_RUN environment variable gets used by internal libMesh "make check" etc. but if you're running something directly you can just prepend that to your command yourself.

Alas, Ctrl+C is not working.

wgm096350 commented 4 years ago

I don't know how to fix your problem ... though I am curious about it. Is it possible your MPI implementation spawns some kind of helper process that it gives the same name? But even so, "The first thing GDB does after arranging to debug the specified process is to stop it" and that should be happening even if it's the wrong process somehow. Is it possible the process you attached to is a libtool script rather than its underlying executable? But at least my version of libtool eventually does an "exec", it doesn't fork off a child process... and either way there still should be something getting stopped, right? What happens if you Ctrl-C? Does it stop? Can you get a stack trace if it does?

Anyway, I do have a suggested workaround: you can always run programs through gdb rather than attaching it later. E.g. for small-scale parallel debugging problems I tend to use LIBMESH_RUN='mpirunxterm -np 2 libtool --mode=execute gdb --args' - the "mpirunxterm" bit is a short script to give each process its own window:

#!/bin/bash

numproc=$2

shift
shift

exec mpirun -np $numproc xterm -e bash -c "$* && read -p COMPLETED finalinput || read -p FAILED finalinput; sleep 20"

The libtool command makes sure that gdb gets applied to the underlying binary and not to any scripts if I'm working on pre-installed programs. The LIBMESH_RUN environment variable gets used by internal libMesh "make check" etc. but if you're running something directly you can just prepend that to your command yourself.

Using Ctrl+C in gdb is futile, but in the terminal that run the commmand mpirun -np 1 --allow-run-as-root ./systems_of_equations_ex5 -ksp_type cg, Ctrl +C stops the program. 批注 2019-11-08 233104 批注 2019-11-08 233642

wgm096350 commented 4 years ago

I have tried out an workaround. Run mpiexec -np 2 xterm -e gdb ./systems_of_equations_ex5 gdb generated 2 xterm windows. Each windows exectute b 86 to set a breakpoint at while(gdb_break) {};. Then execute r -ksp_type cg in each xterm window. In this way all process woube stuck at while(gdb_break) {}; and other debugging gdb commands can be used. 批注 2019-11-09 081050 批注 2019-11-09 081201

permcody commented 4 years ago

We have a little write-up on parallel debugging here. We've built support right into MOOSE for launching the debugging on multiple processes which is a little more straightforward, than what you are trying here. Take a look and see if this gets you past your current issue: https://www.mooseframework.org/application_development/debugging.html

On Fri, Nov 8, 2019 at 5:17 PM wgm096350 notifications@github.com wrote:

I have tried out an workaround. Run mpiexec -np 2 xterm -e gdb ./systems_of_equations_ex5 gdb generated 2 xterm windows. Each windows exectute b 86 to set a breakpoint at while(gdb_break) {};. Then execute r -ksp_type cg in each xterm window. In this way all process woube stuck at while(gdb_break) {}; and other debugging gdb commands can be used. [image: 批注 2019-11-09 081050] https://user-images.githubusercontent.com/24369346/68518946-49ea5f00-02c9-11ea-9e26-c3857cd5b2ac.png [image: 批注 2019-11-09 081201] https://user-images.githubusercontent.com/24369346/68518951-4f47a980-02c9-11ea-862c-ffc492ab49b7.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/2324?email_source=notifications&email_token=AAXFOIABKMHO4CVPLQHZ4ILQSX6Q5A5CNFSM4JKXQOU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDTXOQY#issuecomment-552040259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFOIHRVU3Q7HMCFM7J5ODQSX6Q5ANCNFSM4JKXQOUQ .

bzjan commented 3 years ago

If you want to reduce the amount of interactive typing necessary and avoid catching the debugger manually, you could run the following command:

(xterm -e 'gdb --batch --command=gdb_script.gdb --args ./segfaulting_program > gdb_output_$OMPI_COMM_WORLD_RANK.log 2>&1' &); for id in $(xdotool search --sync --class xterm); do xdotool windowminimize $id; done

together with a suitable gdb_script.gdb, e.g. here is a minimal version:

set width 0
set height 0
set verbose off
set breakpoint pending on

# run to main (breakpoint 1)
start

# breakpoints

break silly_function
commands
  printf "<gdb> no segfault yet\n"
  continue
end

This will create a log file for each processor that you can use to readily track down your problem without staring at n parallel gdb terminals :)