peterwittek / somoclu

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters
https://peterwittek.github.io/somoclu/
MIT License
266 stars 69 forks source link

(core dumped) #154

Open bedassa opened 4 years ago

bedassa commented 4 years ago

I received core dumped error. My data size is 382776x174688. I submit a job in cluster high performance compauter using the scrips mpirun -np 8 somoclu -g hexagonal -m toroid --rows 22 --columns 17 psl_n.txt psl_DJF

Error in `somoclu': munmap_chunk(): invalid pointer: 0x0000000001807310 ======= Backtrace: ========= /lib64/libc.so.6(+0x7ada4)[0x2b1d72d3cda4] somoclu[0x437528] somoclu[0x4070b3] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1d72ce3b35] somoclu[0x40751c] ======= Memory map: ======== 00400000-005e8000 r-xp 00000000 00:2a 113822987310 /home/bcheneka/Build_WRF/LIBRARIES/somoclu/src/somoclu 007e8000-007e9000 r--p 001e8000 00:2a 113822987310 /home/bcheneka/Build_WRF/LIBRARIES/somoclu/src/somoclu 007e9000-007ea000 rw-p 001e9000 00:2a 113822987310 /home/bcheneka/Build_WRF/LIBRARIES/somoclu/src/somoclu 007ea000-007eb000 rw-p 00000000 00:00 0 01783000-01b25000 rw-p 00000000 00:00 0 [heap] 2b1d6ef41000-2b1d6ef61000 r-xp 00000000 00:24 201029694 /usr/lib64/ld-2.17.so 2b1d6ef61000-2b1d6ef63000 rw-p 00000000 00:00 0 2b1d6ef9b000-2b1d6efa3000 rw-p 00000000 00:00 0 2b1d6f160000-2b1d6f161000 r--p 0001f000 00:24 201029694 /usr/lib64/ld-2.17.so 2b1d6f161000-2b1d6f162000 rw-p 00020000 00:24 201029694 /usr/lib64/ld-2.17.so 2b1d6f162000-2b1d6f163000 rw-p 00000000 00:00 0 2b1d6f163000-2b1d6f165000 r-xp 00000000 00:24 201299306 /usr/lib64/libdl-2.17.so 2b1d6f165000-2b1d6f365000 ---p 00002000 00:24 201299306 /usr/lib64/libdl-2.17.so 2b1d6f365000-2b1d6f366000 r--p 00002000 00:24 201299306 /usr/lib64/libdl-2.17.so 2b1d6f366000-2b1d6f367000 rw-p 00003000 00:24 201299306 /usr/lib64/libdl-2.17.so 2b1d6f367000-2b1d6f3c9000 r-xp 00000000 00:2b 340587319 /opt/ud/cuda-8.0/lib64/libcudart.so.8.0.44 2b1d6f3c9000-2b1d6f5c9000 ---p 00062000 00:2b 340587319 /opt/ud/cuda-8.0/lib64/libcudart.so.8.0.44 2b1d6f5c9000-2b1d6f5cc000 rw-p 00062000 00:2b 340587319 /opt/ud/cuda-8.0/lib64/libcudart.so.8.0.44 2b1d6f5cc000-2b1d6f5cd000 rw-p 00000000 00:00 0 2b1d6f5cd000-2b1d71d52000 r-xp 00000000 00:2b 340587313 /opt/ud/cuda-8.0/lib64/libcublas.so.8.0.45 2b1d71d52000-2b1d71f51000 ---p 02785000 00:2b 340587313 /opt/ud/cuda-8.0/lib64/libcublas.so.8.0.45 2b1d71f51000-2b1d71f6f000 rw-p 02784000 00:2b 340587313 /opt/ud/cuda-8.0/lib64/libcublas.so.8.0.45 2b1d71f6f000-2b1d71f7d000 rw-p 00000000 00:00 0 2b1d71f7d000-2b1d72146000 r-xp 00000000 00:2a 49401166130 /home/bcheneka/gcc-9.2.0/lib64/libstdc++.so.6.0.27 2b1d72146000-2b1d72345000 ---p 001c9000 00:2a 49401166130 /home/bcheneka/gcc-9.2.0/lib64/libstdc++.so.6.0.27 2b1d72345000-2b1d72350000 r--p 001c8000 00:2a 49401166130 /home/bcheneka/gcc-9.2.0/lib64/libstdc++.so.6.0.27 2b1d72350000-2b1d72353000 rw-p 001d3000 00:2a 49401166130 /home/bcheneka/gcc-9.2.0/lib64/libstdc++.so.6.0.27 2b1d72353000-2b1d72356000 rw-p 00000000 00:00 0 2b1d72356000-2b1d72456000 r-xp 00000000 00:24 201413623 /usr/lib64/libm-2.17.so 2b1d72456000-2b1d72656000 ---p 00100000 00:24 201413623 /usr/lib64/libm-2.17.so 2b1d72656000-2b1d72657000 r--p 00100000 00:24 201413623 /usr/lib64/libm-2.17.so 2b1d72657000-2b1d72658000 rw-p 00101000 00:24 201413623 /usr/lib64/libm-2.17.so 2b1d72658000-2b1d7268c000 r-xp 00000000 00:2a 49393205273 /home/bcheneka/gcc-9.2.0/lib64/libgomp.so.1.0.0 2b1d7268c000-2b1d7288c000 ---p 00034000 00:2a 49393205273 /home/bcheneka/gcc-9.2.0/lib64/libgomp.so.1.0.0 2b1d7288c000-2b1d7288d000 r--p 00034000 00:2a 49393205273 /home/bcheneka/gcc-9.2.0/lib64/libgomp.so.1.0.0 2b1d7288d000-2b1d7288e000 rw-p 00035000 00:2a 49393205273 /home/bcheneka/gcc-9.2.0/lib64/libgomp.so.1.0.0 2b1d7288e000-2b1d728a5000 r-xp 00000000 00:2a 49401166125 /home/bcheneka/gcc-9.2.0/lib64/libgcc_s.so.1 2b1d728a5000-2b1d72aa4000 ---p 00017000 00:2a 49401166125 /home/bcheneka/gcc-9.2.0/lib64/libgcc_s.so.1 2b1d72aa4000-2b1d72aa5000 r--p 00016000 00:2a 49401166125 /home/bcheneka/gcc-9.2.0/lib64/libgcc_s.so.1 2b1d72aa5000-2b1d72aa6000 rw-p 00017000 00:2a 49401166125 /home/bcheneka/gcc-9.2.0/lib64/libgcc_s.so.1 2b1d72aa6000-2b1d72abd000 r-xp 00000000 00:24 201413908 /usr/lib64/libpthread-2.17.so 2b1d72abd000-2b1d72cbc000 ---p 00017000 00:24 201413908 /usr/lib64/libpthread-2.17.so 2b1d72cbc000-2b1d72cbd000 r--p 00016000 00:24 201413908 /usr/lib64/libpthread-2.17.so 2b1d72cbd000-2b1d72cbe000 rw-p 00017000 00:24 201413908 /usr/lib64/libpthread-2.17.so 2b1d72cbe000-2b1d72cc2000 rw-p 00000000 00:00 0 2b1d72cc2000-2b1d72e78000 r-xp 00000000 00:24 201299203 /usr/lib64/libc-2.17.so 2b1d72e78000-2b1d73078000 ---p 001b6000 00:24 201299203 /usr/lib64/libc-2.17.so 2b1d73078000-2b1d7307c000 r--p 001b6000 00:24 201299203 /usr/lib64/libc-2.17.so 2b1d7307c000-2b1d7307e000 rw-p 001ba000 00:24 201299203 /usr/lib64/libc-2.17.so 2b1d7307e000-2b1d73083000 rw-p 00000000 00:00 0 2b1d73083000-2b1d7308a000 r-xp 00000000 00:24 201427121 /usr/lib64/librt-2.17.so 2b1d7308a000-2b1d73289000 ---p 00007000 00:24 201427121 /usr/lib64/librt-2.17.so 2b1d73289000-2b1d7328a000 r--p 00006000 00:24 201427121 /usr/lib64/librt-2.17.so 2b1d7328a000-2b1d7328b000 rw-p 00007000 00:24 201427121 /usr/lib64/librt-2.17.so 7fff63ab4000-7fff63ad6000 rw-p 00000000 00:00 0 [stack] 7fff63bd2000-7fff63bd4000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] /var/opt/ud/torque-4.2.10/mom_priv/jobs/285503.hpc12.hpc.SC: line 28: 331675 Aborted (core dumped) MP_NUM_THREADS=8 somoclu -g hexagonal -m toroid --rows 22 --columns 17 psl_n.txt psl_DJF

xgdgsc commented 4 years ago

Is there ulimit on memory usage on server?

bedassa commented 4 years ago

Yes, $ulimit unlimited $ ulimit -u unlimited -bash: ulimit: max user processes: cannot modify limit: Operation not permitted

xgdgsc commented 4 years ago

I don' t have experience debuggin mpi programs. Can you run gdb and see where it crashes? https://stackoverflow.com/questions/329259/how-do-i-debug-an-mpi-program