openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 423 forks source link

SEGV in ucs_debug_backtrace_next(), upon previous SEGV handling, due to ENOMEM situation #8716

Open bfaccini opened 1 year ago

bfaccini commented 1 year ago

Describe the bug

Application has crashed/SEGV (apparently also due to wrong/no handling of ENOMEM/failed allocation), then UCX signal-handler/stack-unwinder also has crashed/SEGV (again due to wrong/no handling of ENOMEM/failed allocation) with the following stack :

Program terminated with signal 11, Segmentation fault.
#0  0x00002b6a9e82c42f in ucs_debug_backtrace_next (bckt=0x0, line=line@entry=0x2b950edea8c0) at debug/debug.c:495
495     {

#0  0x00002b6a9e82c42f in Program terminated with signal 11, Segmentation fault.
#0  0x00002b6a9e82c42f in ucs_debug_backtrace_next (bckt=0x0, line=line@entry=0x2b950edea8c0) at debug/debug.c:495
#1  0x00002b6a9e82c6a7 in ucs_debug_print_backtrace (stream=0x2b6a98b3a1c0 <IO_2_1_stderr>, strip=strip@entry=2) at debug/debug.c:656
#2  0x00002b6a9e82e945 in ucs_handle_error (message=message@entry=0x2b6a9e8f0d27 "tkill(2) or tgkill(2)") at debug/debug.c:1081
#3  0x00002b6a9e82ed0c in ucs_debug_handle_error_signal (signo=signo@entry=11, cause=0x2b6a9e8f0d27 "tkill(2) or tgkill(2)", fmt=fmt@entry=0x2b6a9e8f0f8b " at address %p") at debug/debug.c:1033
#4  0x00002b6a9e82ef7b in ucs_error_signal_handler (signo=11, info=0x2b950edead30, context=<optimized out>) at debug/debug.c:1055
#5  <signal handler called>
#6  0x00002b6a9810266c in palloc_defer_free_create () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#7  0x00002b6a981090a1 in pmemobj_tx_xfree () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#8  0x00002b6a9698fbae in pmem_tx_free (umm=0x2b6d155ae2b8, umoff=72350928) at src/common/mem.c:92
#9  0x00002b6a971960e2 in dtx_rec_release (cont=cont@entry=0x2b6d156efe20, dae=0x2b6d174b4d60, abort=abort@entry=true) at src/vos/vos_dtx.c:665
#10 0x00002b6a9719d9ea in vos_dtx_abort (coh=..., dti=0x2b6d17f87cf0, epoch=epoch@entry=946528706794618887) at src/vos/vos_dtx.c:2204
#11 0x00002b6aa3654aea in dtx_abort (cont=cont@entry=0x2b6d156efa50, dte=dte@entry=0x2b6d17f87cf0, epoch=946528706794618887) at src/dtx/dtx_rpc.c:896
#12 0x00002b6aa3664382 in dtx_leader_end (dlh=0x2b6d17f87cf0, coh=<optimized out>, result=<optimized out>) at src/dtx/dtx_common.c:1290
#13 0x00002b6aa3d9d114 in ds_obj_rw_handler (rpc=0x2b6d17db1db0) at src/object/srv_obj.c:2732
#14 0x00002b6a96ed1f68 in crt_handle_rpc (arg=0x2b6d17db1db0) at src/cart/crt_rpc.c:1654
#15 0x00002b6a97c8f4aa in ABTD_ythread_func_wrapper () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#16 0x00002b6a97c8f651 in make_fcontext () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1 

Looks like the same error handling path than for ucs_debug_backtrace_create() return in ucs_log_print_backtrace() must be done in ucs_debug_print_backtrace(), like with the following changes :

diff --git a/src/ucs/debug/debug.c b/src/ucs/debug/debug.c
index c51c99ba2..4345eb91c 100644
--- a/src/ucs/debug/debug.c
+++ b/src/ucs/debug/debug.c
@@ -650,8 +650,13 @@ void ucs_debug_print_backtrace(FILE *stream, int strip)
     backtrace_h bckt;
     backtrace_line_h bckt_line;
     int i;
+    ucs_status_t status;
+
+    status = ucs_debug_backtrace_create(&bckt, strip);
+    if (status != UCS_OK) {
+        return;
+    }

-    ucs_debug_backtrace_create(&bckt, strip);
     fprintf(stream, "==== backtrace (tid:%7d) ====\n", ucs_get_tid());
     for (i = 0; ucs_debug_backtrace_next(bckt, &bckt_line); ++i) {
          fprintf(stream, UCS_DEBUG_BACKTRACE_LINE_FMT,

Steps to Reproduce


- **Any UCX environment variables used**

### Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
   - `cat /etc/issue` or `cat /etc/redhat-release` + `uname -a`
   - For Nvidia Bluefield SmartNIC include `cat /etc/mlnx-release` (the string identifies software and firmware setup)
- For RDMA/IB/RoCE related issues:
    - Driver version:
        - `rpm -q rdma-core` or `rpm -q libibverbs`
        - or: MLNX_OFED version `ofed_info -s`
   - HW information from `ibstat` or `ibv_devinfo -vv` command
- For GPU related issues:
  - GPU type
  - Cuda: 
      - Drivers version
      - Check if peer-direct is loaded: `lsmod|grep nv_peer_mem` and/or gdrcopy: `lsmod|grep gdrdrv`

### Additional information (depending on the issue)
- OpenMPI version
- Output of `ucx_info -d` to show transports and devices recognized by UCX
- Configure result - config.log
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
bfaccini commented 1 year ago

Let me know if you agree with my analysis and if you want me to push a PR (will need some guidance about usual procedure to follow) ?

yosefe commented 1 year ago

@bfaccini the proposed solution is good enough, though a better way would probably be to use backtrace_fd Pls see https://github.com/openucx/ucx/wiki/Guidance-for-contributors

bfaccini commented 1 year ago

though a better way would probably be to use backtrace_fd

backtrace_fd() ?? you mean backtrace_symbols_fd() ? if yes, I am not sure because the suspected ENOMEM should have occurred in ucs_debug_backtrace_create() I believe.

yosefe commented 1 year ago

Ok. so the proposed solution seems good to me.

bfaccini commented 1 year ago

Sorry, but it took me sometime to grant the contributor agreement from admin@ucfconsortium.org ... I have tried to follow the guidances and have been able to push PR-8741 .