Open bedroge opened 11 months ago
Forgot to mention it, but the same version of waLBerla works fine on this system (regardless of UCX_LOG_LEVEL
) when using even older versions of the compiler toolchain:
Currently Loaded Modules:
1) GCCcore/10.3.0
2) zlib/1.2.11-GCCcore-10.3.0
3) binutils/2.36.1-GCCcore-10.3.0
4) GCC/10.3.0
5) numactl/2.0.14-GCCcore-10.3.0
6) XZ/5.2.5-GCCcore-10.3.0
7) libxml2/2.9.10-GCCcore-10.3.0
8) libpciaccess/0.16-GCCcore-10.3.0
9) hwloc/2.4.1-GCCcore-10.3.0
10) OpenSSL/1.1
11) libevent/2.1.12-GCCcore-10.3.0
12) UCX/1.10.0-GCCcore-10.3.0
13) libfabric/1.12.1-GCCcore-10.3.0
14) PMIx/3.2.3-GCCcore-10.3.0
15) OpenMPI/4.1.1-GCC-10.3.0
16) OpenBLAS/0.3.15-GCC-10.3.0
17) FlexiBLAS/3.0.4-GCC-10.3.0
18) gompi/2021a
19) FFTW/3.3.9-gompi-2021a
20) ScaLAPACK/2.1.0-gompi-2021a-fb
21) foss/2021a
22) bzip2/1.0.8-GCCcore-10.3.0
23) ncurses/6.2-GCCcore-10.3.0
24) libreadline/8.1-GCCcore-10.3.0
25) Tcl/8.6.11-GCCcore-10.3.0
26) SQLite/3.35.4-GCCcore-10.3.0
27) GMP/6.2.1-GCCcore-10.3.0
28) libffi/3.3-GCCcore-10.3.0
29) Python/3.9.5-GCCcore-10.3.0
30) pybind11/2.6.2-GCCcore-10.3.0
31) gzip/1.10-GCCcore-10.3.0
32) lz4/1.9.3-GCCcore-10.3.0
33) zstd/1.4.9-GCCcore-10.3.0
34) ICU/69.1-GCCcore-10.3.0
35) Boost.MPI/1.76.0-gompi-2021a
36) SciPy-bundle/2021.05-foss-2021a
@bedroge can you pls attach to the hanging process with gdb and post the backtrace of the hang (gdb command is "thread apply all backtrace")
Sure! Here it is (for the python
process):
(gdb) thread apply all backtrace
Thread 4 (Thread 0x7f5467989700 (LWP 903569)):
#0 0x00007f547e24775d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f547e240b44 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x00007f547efcd19f in tls_get_addr_tail.isra () from /lib64/ld-linux-x86-64.so.2
#3 0x00007f547efd3cdc in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#4 0x00007f546c27fe8a in ucs_log_set_thread_name (format=format@entry=0x7f546c2930a8 "a") at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/log.c:576
#5 0x00007f546c26f85e in ucs_async_thread_func (arg=0xf64c90) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/thread.c:108
#6 0x00007f547e23e17a in start_thread () from /lib64/libpthread.so.0
#7 0x00007f547dd69dc3 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f546ea21700 (LWP 903566)):
#0 0x00007f547dd6a0f7 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f546f4a54b3 in epoll_dispatch () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#2 0x00007f546f49bc95 in event_base_loop () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#3 0x00007f546ead67e1 in progress_engine () from /project/boegelbot/Rocky8/haswell/software/PMIx/4.2.2-GCCcore-12.2.0/lib/libpmix.so.2
#4 0x00007f547e23e17a in start_thread () from /lib64/libpthread.so.0
#5 0x00007f547dd69dc3 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f546f44b700 (LWP 903565)):
#0 0x00007f547dd5ea41 in poll () from /lib64/libc.so.6
#1 0x00007f546f4a4825 in poll_dispatch () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#2 0x00007f546f49bc95 in event_base_loop () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#3 0x00007f546f8f854e in progress_engine () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libopen-pal.so.40
#4 0x00007f547e23e17a in start_thread () from /lib64/libpthread.so.0
#5 0x00007f547dd69dc3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f547f1d7740 (LWP 903564)):
#0 0x00007f547e23f66d in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1 0x00007f546c26f579 in ucs_async_thread_stop () at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/thread.c:257
#2 0x00007f546c26f7de in ucs_async_thread_remove_event_fd (async=<optimized out>, event_fd=<optimized out>) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/thread.c:353
#3 0x00007f546c26d595 in ucs_async_remove_handler (id=<optimized out>, is_sync=is_sync@entry=1) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/async.c:567
#4 0x00007f546c28239a in ucs_rcache_global_list_remove (rcache=0xbe3a70) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/rcache.c:1193
#5 0x00007f546c2830eb in ucs_rcache_t_cleanup (self=0xbe3a70) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/rcache.c:1331
#6 0x00007f546c28dcae in ucs_class_call_cleanup_chain (cls=cls@entry=0x7f546c2a8620 <ucs_rcache_t_class>, obj=obj@entry=0xbe3a70, limit=limit@entry=-1)
at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/class.c:56
#7 0x00007f546c2841f8 in ucs_rcache_destroy (self=0xbe3a70) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucs/rcache.c:1358
#8 0x00007f546c31add1 in ucp_mem_rcache_cleanup (context=<optimized out>) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucp/ucp_mm.c:1048
#9 0x00007f546c307afb in ucp_cleanup (context=0xf5e630) at /tmp/boegelbot/UCX/1.13.1/GCCcore-12.2.0/ucx-1.13.1/src/ucp/ucp_context.c:1938
#10 0x00007f546c3a3265 in mca_pml_ucx_close () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/openmpi/mca_pml_ucx.so
#11 0x00007f546c3a5719 in mca_pml_ucx_component_close () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/openmpi/mca_pml_ucx.so
#12 0x00007f546f9148d9 in mca_base_component_close () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libopen-pal.so.40
#13 0x00007f546f914965 in mca_base_components_close () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libopen-pal.so.40
#14 0x00007f546fc64de4 in mca_pml_base_select () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libmpi.so.40
#15 0x00007f546fc70da8 in ompi_mpi_init () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libmpi.so.40
#16 0x00007f546fc14c04 in PMPI_Init () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/libmpi.so.40
#17 0x00007f5470584fa3 in walberla::mpi::MPIManager::initializeMPI(int*, char***, bool) ()
--Type <RET> for more, q to quit, c to continue without paging--
from /home/bedroge/easybuildinstall/software/waLBerla/6.1-foss-2022b/pythonmodule/waLBerla/walberla_cpp.cpython-310-x86_64-linux-gnu.so
#18 0x00007f5470834bdb in walberla::python_coupling::initWalberlaForPythonModule() ()
from /home/bedroge/easybuildinstall/software/waLBerla/6.1-foss-2022b/pythonmodule/waLBerla/walberla_cpp.cpython-310-x86_64-linux-gnu.so
#19 0x00007f54703de732 in InitObject::InitObject() ()
from /home/bedroge/easybuildinstall/software/waLBerla/6.1-foss-2022b/pythonmodule/waLBerla/walberla_cpp.cpython-310-x86_64-linux-gnu.so
#20 0x00007f547038663a in _GLOBAL__sub_I_PythonModule.cpp ()
from /home/bedroge/easybuildinstall/software/waLBerla/6.1-foss-2022b/pythonmodule/waLBerla/walberla_cpp.cpython-310-x86_64-linux-gnu.so
#21 0x00007f547efca8ba in call_init.part () from /lib64/ld-linux-x86-64.so.2
#22 0x00007f547efca9ba in _dl_init () from /lib64/ld-linux-x86-64.so.2
#23 0x00007f547dda530c in _dl_catch_exception () from /lib64/libc.so.6
#24 0x00007f547efcee8e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#25 0x00007f547dda52b4 in _dl_catch_exception () from /lib64/libc.so.6
#26 0x00007f547efce6b1 in _dl_open () from /lib64/ld-linux-x86-64.so.2
#27 0x00007f547e7d91ea in dlopen_doit () from /lib64/libdl.so.2
#28 0x00007f547dda52b4 in _dl_catch_exception () from /lib64/libc.so.6
#29 0x00007f547dda5373 in _dl_catch_error () from /lib64/libc.so.6
#30 0x00007f547e7d9969 in _dlerror_run () from /lib64/libdl.so.2
#31 0x00007f547e7d928a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#32 0x00007f547edd9aae in _PyImport_FindSharedFuncptr (prefix=0x7f547ee6b42f "PyInit", shortname=0x7f5470c4b2f0 "walberla_cpp",
pathname=0x7f5470c7f540 "/home/bedroge/easybuildinstall/software/waLBerla/6.1-foss-2022b/pythonmodule/waLBerla/walberla_cpp.cpython-310-x86_64-linux-gnu.so", fp=0x0)
at Modules/transmogrify.h:100
#33 0x00007f547edd89d3 in _PyImport_LoadDynamicModuleWithSpec (fp=<optimized out>, spec=0x7f5470cabbb0) at ./Python/pycore_hashtable.h:137
#34 _imp_create_dynamic_impl (module=<optimized out>, file=<optimized out>, spec=0x7f5470cabbb0) at Objects/pylifecycle.c:2049
#35 _imp_create_dynamic (module=<optimized out>, args=<optimized out>, nargs=<optimized out>) at Modules/fastsearch.h:330
#36 0x00007f547ed4414a in cfunction_vectorcall_FASTCALL (func=0x7f547f1858a0, args=0x7f5470c4bd78, nargsf=<optimized out>, kwnames=<optimized out>) at ./Python/pycore_bitutils.h:430
#37 0x00007f547ed3bcf0 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4277
#38 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#39 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#40 0x00007f547ed398b9 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f1005c8, callable=0x7f547f139510, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#41 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f1005c8, callable=0x7f547f139510) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#42 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdb640, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#43 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4181
#44 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#45 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#46 0x00007f547ed38dff in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f190838, callable=0x7f547f1ba950, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#47 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f190838, callable=0x7f547f1ba950) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#48 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdb8a0, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#49 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4198
#50 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
--Type <RET> for more, q to quit, c to continue without paging--
#51 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#52 0x00007f547ed38ab4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f139ea0, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#53 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f139ea0) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#54 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdbb00, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#55 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4213
#56 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#57 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#58 0x00007f547ed38ab4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13a0e0, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#59 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13a0e0) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#60 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdbd60, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#61 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4213
#62 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#63 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#64 0x00007f547ed38ab4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13b2e0, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#65 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13b2e0) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#66 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdbfc0, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#67 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4213
#68 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#69 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#70 0x00007f547ed437cb in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=2, args=0x7fffcecdc150, callable=0x7f547f13b370, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:99
#71 object_vacall (tstate=0x9823c0, base=<optimized out>, callable=0x7f547f13b370, vargs=0x7fffcecdc1e0) at ./Modules/abstract.h:734
#72 0x00007f547ed4ef08 in _PyObject_CallMethodIdObjArgs (obj=0x0, name=<optimized out>) at ./Modules/abstract.h:825
#73 0x00007f547ed4e7da in import_find_and_load (abs_name=0x7f5470ac9d40, abs_name@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=0x9823c0, tstate@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/pylifecycle.c:1521
#74 PyImport_ImportModuleLevelObject (name=0x7f5470ad4370, globals=<optimized out>, locals=<optimized out>, fromlist=0x7f5470c4b7f0, level=1) at Objects/pylifecycle.c:1622
#75 0x00007f547ed3c068 in import_name (level=0x7f547f0d80f0, fromlist=0x7f5470c4b7f0, name=0x7f5470ad4370, f=<optimized out>, tstate=<optimized out>) at Objects/ceval_gil.h:6016
#76 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:3695
#77 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#78 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#79 0x00007f547edad249 in PyEval_EvalCode (co=0x7f5470c7ddc0, globals=0x7f5470c9d940, locals=0x7f5470c9d940) at Objects/ceval_gil.h:1134
#80 0x00007f547edb4497 in builtin_exec_impl (module=<optimized out>, locals=0x7f5470c9d940, globals=0x7f5470c9d940, source=0x7f5470c7ddc0) at Python/getplatform.c:1003
#81 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/cellobject.c:371
#82 0x00007f547ed4414a in cfunction_vectorcall_FASTCALL (func=0x7f547f170e00, args=0x7f5470cb7458, nargsf=<optimized out>, kwnames=<optimized out>) at ./Python/pycore_bitutils.h:430
#83 0x00007f547ed3bcf0 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4277
--Type <RET> for more, q to quit, c to continue without paging--
#84 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#85 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=3, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#86 0x00007f547ed398b9 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f19f648, callable=0x7f547f139510, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#87 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f19f648, callable=0x7f547f139510) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#88 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdca20, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#89 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4181
#90 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#91 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#92 0x00007f547ed38dff in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f1960c0, callable=0x7f547f1b9a20, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#93 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f547f1960c0, callable=0x7f547f1b9a20) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#94 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdcc80, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#95 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4198
#96 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#97 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#98 0x00007f547ed38ab4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13a0e0, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#99 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13a0e0) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#100 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdcee0, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#101 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4213
#102 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#103 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#104 0x00007f547ed38ab4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13b2e0, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#105 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7f547f13b2e0) at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:123
#106 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffcecdd140, tstate=<optimized out>) at Objects/ceval_gil.h:5891
#107 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:4213
#108 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#109 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#110 0x00007f547ed437cb in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=2, args=0x7fffcecdd2d0, callable=0x7f547f13b370, tstate=0x9823c0)
at /tmp/boegelbot/Python/3.10.8/GCCcore-12.2.0/Python-3.10.8/abstract.c:99
#111 object_vacall (tstate=0x9823c0, base=<optimized out>, callable=0x7f547f13b370, vargs=0x7fffcecdd360) at ./Modules/abstract.h:734
#112 0x00007f547ed4ef08 in _PyObject_CallMethodIdObjArgs (obj=0x0, name=<optimized out>) at ./Modules/abstract.h:825
#113 0x00007f547ed4e7da in import_find_and_load (abs_name=0x7f5470c46370, abs_name@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=0x9823c0, tstate@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/pylifecycle.c:1521
#114 PyImport_ImportModuleLevelObject (name=0x7f5470c46370, globals=<optimized out>, locals=<optimized out>, fromlist=0x7f547efa5ae0 <_Py_NoneStruct>, level=0)
--Type <RET> for more, q to quit, c to continue without paging--
at Objects/pylifecycle.c:1622
#115 0x00007f547ed3c068 in import_name (level=0x7f547f0d80d0, fromlist=0x7f547efa5ae0 <_Py_NoneStruct>, name=0x7f5470c46370, f=<optimized out>, tstate=<optimized out>)
at Objects/ceval_gil.h:6016
#116 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Objects/ceval_gil.h:3695
#117 0x00007f547ed379cb in _PyEval_EvalFrame (throwflag=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
f=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>,
tstate=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/marshal.c:46
#118 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=<optimized out>) at Objects/ceval_gil.h:5065
#119 0x00007f547edad249 in PyEval_EvalCode (co=0x7f5470befe10, globals=0x7f5470c00100, locals=0x7f5470c00100) at Objects/ceval_gil.h:1134
#120 0x00007f547edbd9e3 in run_eval_code_obj (tstate=0x9823c0, co=0x7f5470befe10, globals=0x7f5470c00100, locals=0x7f5470c00100) at Modules/find.h:1291
#121 0x00007f547edb96ea in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7f5470c00100, locals=0x7f5470c00100, flags=<optimized out>, arena=<optimized out>)
at Modules/find.h:1312
#122 0x00007f547edb17cd in PyRun_StringFlags (str=<optimized out>, start=257, globals=0x7f5470c00100, locals=0x7f5470c00100, flags=0x7fffcecdd8a0) at Modules/find.h:1183
#123 0x00007f547edb172c in PyRun_SimpleStringFlags (command=0x7f5470c12a10 "import waLBerla\n", flags=0x7fffcecdd8a0) at Modules/find.h:503
#124 0x00007f547edc9c7b in pymain_run_command (command=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at Objects/fileutils.c:248
#125 pymain_run_python (exitcode=0x7fffcecdd894) at Objects/fileutils.c:578
#126 Py_RunMain () at Objects/fileutils.c:666
#127 0x00007f547ed9ff67 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Objects/fileutils.c:720
#128 0x00007f547dc90493 in __libc_start_main () from /lib64/libc.so.6
#129 0x000000000040106e in _start ()
And for the mpirun
process itself:
(gdb) thread apply all backtrace
Thread 4 (Thread 0x7f91a7de2700 (LWP 903563)):
#0 0x00007f91a919c29f in select () from /lib64/libc.so.6
#1 0x00007f91a7deba80 in listen_thread () from /project/boegelbot/Rocky8/haswell/software/OpenMPI/4.1.4-GCC-12.2.0/lib/openmpi/mca_oob_tcp.so
#2 0x00007f91a947517a in start_thread () from /lib64/libpthread.so.0
#3 0x00007f91a91a4dc3 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f91a8608700 (LWP 903562)):
#0 0x00007f91a919c29f in select () from /lib64/libc.so.6
#1 0x00007f91a8f9e087 in listen_thread () from /project/boegelbot/Rocky8/haswell/software/PMIx/4.2.2-GCCcore-12.2.0/lib/libpmix.so.2
#2 0x00007f91a947517a in start_thread () from /lib64/libpthread.so.0
#3 0x00007f91a91a4dc3 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f91a8e13700 (LWP 903561)):
#0 0x00007f91a91a50f7 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f91a9a354b3 in epoll_dispatch () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#2 0x00007f91a9a2bc95 in event_base_loop () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#3 0x00007f91a8ec17e1 in progress_engine () from /project/boegelbot/Rocky8/haswell/software/PMIx/4.2.2-GCCcore-12.2.0/lib/libpmix.so.2
#4 0x00007f91a947517a in start_thread () from /lib64/libpthread.so.0
#5 0x00007f91a91a4dc3 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f91a90a3740 (LWP 903560)):
#0 0x00007f91a9199a41 in poll () from /lib64/libc.so.6
#1 0x00007f91a9a34825 in poll_dispatch () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#2 0x00007f91a9a2bc95 in event_base_loop () from /project/boegelbot/Rocky8/haswell/software/libevent/2.1.12-GCCcore-12.2.0/lib/libevent_core-2.1.so.7
#3 0x0000000000401399 in orterun ()
#4 0x00007f91a90cb493 in __libc_start_main () from /lib64/libc.so.6
#5 0x000000000040113e in _start ()
A quick search shows it could be this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514
A quick search shows it could be this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514
Thanks, I looked into this a bit, and I'm not sure if I completely understood that issue. But I tried recompiling OpenBLAS with USE_TLS=0
, but that didn't make a difference. Or did you mean that it could be a similar issue, but between glibc
and UCX
? Is there any way I can test this somehow?
By the way, I also tried to use the same compiler toolchain but with an older UCX version (1.10.0), and that did work fine.
A quick search shows it could be this issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514
Thanks, I looked into this a bit, and I'm not sure if I completely understood that issue. But I tried recompiling OpenBLAS with
USE_TLS=0
, but that didn't make a difference. Or did you mean that it could be a similar issue, but betweenglibc
andUCX
? Is there any way I can test this somehow?By the way, I also tried to use the same compiler toolchain but with an older UCX version (1.10.0), and that did work fine.
it seems like an issue between glibc and UCX: a deadlock between reading TLS value from one thread and dlclose() from another thread. dlclose() takes TLS lock, which calls UCX destructor, which tries to stop a thread that is reading TLS and stuck on the TLS lock. One workaround I can think of is that the main thread would wait for the async thread to get past the point of reading TLS value when spawning a new thread, and before returning to main thread flow and allowing dlclose() to happen. UCX version 1.10.0 did not use TLS.
Describe the bug
I'm running into a weird issue on one particular system where importing the Python interface of waLBerla, which I compiled from source using EasyBuild, hangs:
Then I found that disabling the UCX PML solved the issue:
So I tried again with UCX and some more debugging output, but then suddenly it works:
I've tried it with both the following set of dependencies:
and with some slightly newer versions:
And also with UCX/1.15.0 I'm still seeing this same issue.
These are the last lines of
strace
output for a run that hangs:I'm not sure how to get more information, as increasing the verbosity solves the issue. I've included the output for a (successful) run with
UCX_LOG_LEVEL
at the bottom of this issue.Steps to Reproduce
mpirun -np 1 python -c "import waLBerla"
ucx_info -v
)UCX_LOG_LEVEL
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Linux login1 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
For RDMA/IB/RoCE related issues:
rpm -q rdma-core
:rdma-core-35.0-1.el8.x86_64
rpm -q libibverbs
:libibverbs-35.0-1.el8.x86_64
ibstat
oribv_devinfo -vv
command:ibstat
is available, but there's no Infiniband, hence no output from the commandAdditional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX:[1701777539.029420] [login1:643305:0] debug.c:1146 UCX DEBUG using signal stack 0x7fe1dcc9b000 size 141824 [1701777539.029493] [login1:643305:0] cpu.c:233 UCX DEBUG CPU does not support invariant TSC, using fallback timer [1701777539.029518] [login1:643305:0] init.c:118 UCX DEBUG /project/boegelbot/Rocky8/haswell/software/UCX/1.13.1-GCCcore-12.2.0/lib/libucs.so.0 loaded at 0x7fe1dce15000 [1701777539.029541] [login1:643305:0] init.c:120 UCX DEBUG cmd line: python -c import waLBerla [1701777539.029555] [login1:643305:0] module.c:72 UCX DEBUG ucs library path: /project/boegelbot/Rocky8/haswell/software/UCX/1.13.1-GCCcore-12.2.0/lib/libucs.so.0 [1701777539.029560] [login1:643305:0] module.c:282 UCX DEBUG loading modules for ucs [1701777539.030590] [login1:643305:0] time.c:22 UCX DEBUG arch clock frequency: 1000000.00 Hz [1701777539.030660] [login1:643305:0] ucp_context.c:1849 UCX INFO Version 1.13.1 (loaded from /project/boegelbot/Rocky8/haswell/software/UCX/1.13.1-GCCcore-12.2.0/lib/libucp.so.0) [1701777539.030672] [login1:643305:0] ucp_context.c:1624 UCX DEBUG estimated number of endpoints is 1 [1701777539.030675] [login1:643305:0] ucp_context.c:1631 UCX DEBUG estimated number of endpoints per node is 1 [1701777539.030682] [login1:643305:0] ucp_context.c:1638 UCX DEBUG estimated bcopy bandwidth is 6081740800.000000 [1701777539.030702] [login1:643305:0] ucp_context.c:1705 UCX DEBUG allocation method[0] is md 'sysv' [1701777539.030709] [login1:643305:0] ucp_context.c:1705 UCX DEBUG allocation method[1] is md 'posix' [1701777539.030716] [login1:643305:0] ucp_context.c:1717 UCX DEBUG allocation method[2] is 'huge' [1701777539.030722] [login1:643305:0] ucp_context.c:1717 UCX DEBUG allocation method[3] is 'thp' [1701777539.030725] [login1:643305:0] ucp_context.c:1705 UCX DEBUG allocation method[4] is md '*' [1701777539.030732] [login1:643305:0] ucp_context.c:1717 UCX DEBUG allocation method[5] is 'mmap' [1701777539.030734] [login1:643305:0] ucp_context.c:1717 UCX DEBUG allocation method[6] is 'heap' [1701777539.030751] [login1:643305:0] module.c:282 UCX DEBUG loading modules for uct [1701777539.033561] [login1:643305:0] module.c:282 UCX DEBUG loading modules for uct_ib [1701777539.033640] [login1:643305:0] ib_md.c:1195 UCX DEBUG Failed to get IB device list, assuming no devices are present [1701777539.034009] [login1:643305:0] ib_md.c:1195 UCX DEBUG Failed to get IB device list, assuming no devices are present [1701777539.034080] [login1:643305:0] mpool.c:98 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144 [1701777539.037397] [login1:643305:0] async.c:230 UCX DEBUG added async handler 0x954bf0 [id=23 ref 1] ucs_rcache_invalidate_handler() to hash [1701777539.037532] [login1:643305:0] async.c:508 UCX DEBUG listening to async event fd 23 events 0x1 mode thread_spinlock [1701777539.037600] [login1:643305:0] module.c:282 UCX DEBUG loading modules for ucm [1701777539.037632] [login1:643305:0] ucp_context.c:1913 UCX DEBUG created ucp context 0x95dcb0 0x95dcb0 [5 mds 6 tls] features 0x1 tl bitmap 0x3f 0x0 [1701777539.044183] [login1:643305:0] async.c:155 UCX DEBUG removed async handler 0x954bf0 [id=23 ref 1] ucs_rcache_invalidate_handler() from hash [1701777539.044199] [login1:643305:0] async.c:561 UCX DEBUG removing async handler 0x954bf0 [id=23 ref 1] ucs_rcache_invalidate_handler() [1701777539.044270] [login1:643305:0] async.c:170 UCX DEBUG release async handler 0x954bf0 [id=23 ref 0] ucs_rcache_invalidate_handler() [1701777539.044283] [login1:643305:0] pgtable.c:618 UCX DEBUG purge empty page table [1701777539.044293] [login1:643305:0] mpool.c:154 UCX DEBUG mpool rcache_mp destroyed