uwsampa / grappa

Grappa: scaling irregular applications on commodity clusters
grappa.io
BSD 3-Clause "New" or "Revised" License
157 stars 51 forks source link

Signal 11 error on multiple machines #276

Open TimWin opened 8 years ago

TimWin commented 8 years ago

Hello, I get this error when trying to run any grappa program on multiple machines:

mpirun -hostfile my_hosts applications/demos/hello_world.exe . . . I0328 12:19:22.515194 101851 Grappa.cpp:647] Shared memory breakdown: node total: 125.524 GB locale shared heap total: 62.7622 GB locale shared heap per core: 62.7622 GB communicator per core: 0.125 GB tasks per core: 0.0156631 GB global heap per core: 15.6905 GB aggregator per core: 0.0650177 GB shared_pool current per core: 4.76837e-07 GB shared_pool max per core: 15.6905 GB free per locale: 46.8659 GB free per core: 46.8659 GB

Exiting due to signal 11 with siginfo 0x4003f5326870 and payload 0x4003f5326740 I0328 12:19:22.534696 101851 hello_world.cpp:45] Hello world from locale 0 core 0

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[29000,1],1] Exit code: 1

I can successfully execute the programs, e.g. hello_world, on a single machine, but it always chrashes with that signal 11 error when I try to run it on multiple machines.

What can I do to solve that problem? Please let me know if you need any further information.

Thanks in advance

bmyerz commented 8 years ago

I think more information is needed, starting with where the signal is thrown. Try building with Debug mode and running with freeze on error (see https://github.com/uwsampa/grappa/blob/master/doc/debugging.md#debugging).

If the process freezes on the signal, then ssh into the node that had the signal and do gdb attach <pid>. You can find the pid of the running grappa process with something like ps aux | grep grappa. From there you can do a backtrace.

If the process doesn't freeze on the signal then you can have mpirun launch the processes through gdb. (see the #2 answer to question 6 on https://www.open-mpi.org/faq/?category=debugging)

jeffhammond commented 8 years ago

Here is a stacktrace

[jrhammon@esgmonster prk-repo]$ mpirun -n 1 gdb GRAPPA/Transpose/transpose 10 3600 32
Excess command line arguments ignored. (3600 ...)
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose...done.
Attaching to program: /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose, process 10
ptrace: Operation not permitted.
/home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/10: No such file or directory.
(gdb) run 10 1000 32
Starting program: /home/jrhammon/Work/INTEL/PCL/ESG/PRK/github-official/GRAPPA/Transpose/transpose 10 1000 32
[Thread debugging using libthread_db enabled]
warning: File "/opt/gcc/5.3.0/lib64/libstdc++.so.6.0.21-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "/usr/share/gdb/auto-load:/usr/lib/debug:/usr/bin/mono-gdb.py".
To enable execution of this file add
    add-auto-load-safe-path /opt/gcc/5.3.0/lib64/libstdc++.so.6.0.21-gdb.py
line to your configuration file "/home/jrhammon/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/jrhammon/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"
I0704 15:54:00.408108 110487 Allocator.hpp:185] Allocator is responsible for addresses from 0 to 0x1f6787000
I0704 15:54:00.408323 110487 GlobalMemory.cpp:67] Initialized GlobalMemory with 8430055424 bytes of shared heap.
I0704 15:54:00.412102 110487 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   62.8088 GB
  locale shared heap total:     31.4044 GB
  locale shared heap per core:  31.4044 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         7.8511 GB
  aggregator per core:          0.00247955 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     7.8511 GB
  free per locale:              23.4102 GB
  free per core:                23.4102 GB
-------------------------
Parallel Research Kernels version 2.16
Grappa matrix transpose: B = A^T
Parallel Research Kernels version 2.16
Grappa matrix transpose: B = A^T
Number of cores         = 1
Matrix order            = 1000
Number of iterations    = 10
Tile size               = 32
Solution validates
Rate (MB/s): 6500.35 Avg time (s): 0.00246141
Summed errors: 0

Program received signal SIGSEGV, Segmentation fault.
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::c_str() const () at /tmp/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h:1889
    in /tmp/gcc-5.3.0/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.h

The code GDB is trying to point to is:

      // String operations:
      /**
       *  @brief  Return const pointer to null-terminated contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      c_str() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }