openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

memory hook test failure on PPC #4681

Open dmitrygx opened 4 years ago

dmitrygx commented 4 years ago
==== Running memory hook (malloc_hooks) on MPI with LD_PRELOAD ====
+ ucm_lib=/scrap/jenkins/workspace/hpc-ucx-pr/label/r-vmb-ppc-jenkins/worker/0/build-test/src/ucm/.libs/libucm.so
+ ls -l /scrap/jenkins/workspace/hpc-ucx-pr/label/r-vmb-ppc-jenkins/worker/0/build-test/src/ucm/.libs/libucm.so
lrwxrwxrwx 1 swx-jenkins swx-jenkins 15 Jan 15 16:00 /scrap/jenkins/workspace/hpc-ucx-pr/label/r-vmb-ppc-jenkins/worker/0/build-test/src/ucm/.libs/libucm.so -> libucm.so.0.0.0
+ mpirun -x UCX_ERROR_SIGNALS -x UCX_HANDLE_ERRORS -mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include lo -mca coll '^hcoll,ml' -np 1 -x LD_PRELOAD=/scrap/jenkins/workspace/hpc-ucx-pr/label/r-vmb-ppc-jenkins/worker/0/build-test/src/ucm/.libs/libucm.so taskset -c 6,7 ./test/mpi/test_memhooks -t malloc_hooks
malloc_hooks: initialized
Allocating memory
After shmat: reported mapped=1048576
After shmdt: reported unmapped=1048576
After shmat(REMAP): reported mapped=1048576 unmapped=1048576
After shmdt: reported unmapped=1048576
After core malloc: reported mapped=1179648
After mmap malloc: reported mapped=2162688
After mmap: reported mapped=1048576
After mmap(FIXED): reported unmapped=1048576
After munmap: reported unmapped=1048576
After mmap free + trim: reported unmapped=2293760
After another mmap from dynamic lib: reported mapped=1048576
After core malloc free: reported unmapped=917504
malloc_hooks: PASS
[r-vmb-ppc-jenkins:7613 :0:7614] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3fffb51bbb70)
==== backtrace (tid:   7614) ====
 0 0x0000000000050af0 ucs_debug_print_backtrace()  /scrap/jenkins/workspace/hpc-ucx-pr/label/r-vmb-ppc-jenkins/worker/0/contrib/../src/ucs/debug/debug.c:625
 1 0x00000000000a58f0 event_process_active_single_queue()  /build-result/src/hpcx-gcc-redhat7.4/ompi-v4.0.x/opal/mca/event/libevent2022/libevent/event.c:1370
 2 0x000000000004598c progress_engine()  /build-result/src/hpcx-gcc-redhat7.4/ompi-v4.0.x/opal/runtime/opal_progress_threads.c:105
 3 0x0000000000008af4 start_thread()  pthread_create.c:0
 4 0x0000000000124ef4 __clone()  ???:0
=================================

Gist: https://gist.githubusercontent.com/mellanox-github/97347b6cb34c7babc9cefc1b0e2593ce/raw/4732a3049c45d2f440d31f30a38ac5f98f9bad64/r-vmb-ppc-jenkins_W0 Jenknins job: http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/14092/label=r-vmb-ppc-jenkins,worker=0/console

evgeny-leksikov commented 4 years ago

similar on x86: http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/label=hpc-test-node-gpu,worker=0/14208/consoleFull#-853622487b816df86-9afc-46eb-984f-1a9a125d73bb

alinask commented 4 years ago

Same on hpc-test-node-gpu02:

13:04:47 + echo '==== Running memory hook (malloc_hooks) on MPI ===='
13:04:47 ==== Running memory hook (malloc_hooks) on MPI ====
13:04:47 + mpirun -x UCX_ERROR_SIGNALS -x UCX_HANDLE_ERRORS -mca pml ob1 -mca btl tcp,self -mca btl_tcp_if_include lo -mca coll '^hcoll,ml' -np 1 taskset -c 16,17 ./test/mpi/test_memhooks -t malloc_hooks
13:04:48 malloc_hooks: initialized
13:04:48 Allocating memory
13:04:48 After shmat: reported mapped=1048576
13:04:48 After shmdt: reported unmapped=1048576
13:04:48 After shmat(REMAP): reported mapped=1048576 unmapped=1048576
13:04:48 After shmdt: reported unmapped=1048576
13:04:48 After core malloc: reported mapped=1150976
13:04:48 After mmap malloc: reported mapped=2101248
13:04:48 After mmap: reported mapped=1048576
13:04:48 After mmap(FIXED): reported unmapped=1048576
13:04:48 After munmap: reported unmapped=1048576
13:04:48 After mmap free + trim: reported unmapped=2232320
13:04:48 After another mmap from dynamic lib: reported mapped=1048576
13:04:48 After core malloc free: reported unmapped=917504
13:04:48 malloc_hooks: PASS
13:04:48 [hpc-test-node-gpu02:29105:0:29112] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa0828b4968)
13:04:48 --------------------------------------------------------------------------
13:04:48 Primary job  terminated normally, but 1 process returned
13:04:48 a non-zero exit code. Per user-direction, the job has been aborted.
13:04:48 --------------------------------------------------------------------------
13:04:50 --------------------------------------------------------------------------
13:04:50 mpirun noticed that process rank 0 with PID 0 on node hpc-test-node-gpu02 exited on signal 11 (Segmentation fault).
13:04:50 --------------------------------------------------------------------------