Open Janekdererste opened 2 years ago
Forgive me; I'm fairly ignorant of Java. I tried to replicate the problem, but I notice that when I run the mpi.sh
from the https://github.com/mboysan/ping-pong-mpi-tcp project, the maven build downloads an Open MPI jarfile for v4.0.1. I assume it then uses that jarfile to run the application.
How do I get it to use my local Open MPI Java install?
Thank you very much for trying it out. This is a little tedious, unfortunately.
The jar has to be in some kind of maven repository. You could create a local maven repository like this:
$ mvn deploy:deploy-file \
-Durl=file:///path/to/where/you/want/the/local/maven/repo/to/be \
-Dfile=/path/to/your/mpi.jar \
-DgroupId=org.openmpi \
-DartifactId=mpi \
-Dpackaging=jar \
-Dversion=4.1.2
ArtifactId, groupId and version can be chosen how ever you like
Then in the pom.xml, the local repository needs to be added to the <repositories>
section. Like this:
<repository>
<id>Local-mpi</id>
<url>file:///path/to/where/you/want/the/local/maven/repo/to/be</url>
</repository>
Then in the <dependecies>
section of the pom.xml
file you can replace the mpi dependency with the following:
<dependency>
<groupId>org.openmpi</groupId>
<artifactId>mpi</artifactId>
<version>4.1.2</version>
</dependency>
GroupId, artifactId and version must match what you have specified in the mvn:deploy
command.
Thanks! Let me give this a whirl.
@Janekdererste Thanks for the instructions -- with that, I got the ping pong to compile and run with my local Open MPI Java build.
Unfortunately, it runs successfully for me. However, I see that you're using UCX. I wonder if there's some kind of conflict here with registered memory for InfiniBand...?
@open-mpi/ucx Can you please have a look at this? NOTE: While the same issue undoubtedly exists in main/v5.0.x, be aware of #10245 in terms of installing PMIx/PRTE in the same prefix as OMPI (at least until the issue is resolved).
@Janekdererste can you please try without ucx (replace "--mca pml ucx" by "-mca btl tcp")? @jsquyres how do the Java OpenMPI bindings work in the sense of making sure the buffer virtual address is not moved before a non-blocking MPI request is completed?
@yosefe IIRC, the Java bindings use some sort of special type of Java buffer that is effectively pinned.
@Janekdererste in order to force btl/tcp
you will need to
mpirun --mca pml ob1 --mca btl tcp,self
@yosefe Check out https://docs.open-mpi.org/en/v5.0.x/features/java.html#how-to-specify-buffers. These are the upcoming 5.0 docs, but the content basically hasn't changed since Open MPI v4.x.
@Janekdererste While researching this issue, I also updated the Java Open MPI docs. I came across this statement in the docs:
All non-blocking methods must use direct buffers and only blocking methods can choose between arrays and direct buffers.
Does your app (and/or this sample ping-pong app) use any non-blocking MPI methods? If so, are direct buffers used?
Thanks all for helping on this issue. I tried to run the program with
mpirun --mca pml ob1 --mca btl tcp,self -np 2 java -cp ./target/*jar-with-dependencies.jar MPIMain 2 false
as suggested by @ggouaillardet. The program seems to crash in the same place. The jvm dumps (hs_err_pid22088.log
hs_err_pid22089.log) of the two processes
and the log looks like this:
2022-04-11 14:48:58 [main] MPIMain.main()
INFO: Args received: [2, false]
2022-04-11 14:48:58 [main] MPIMain.main()
INFO: Args received: [2, false]
[INFO] 14:48:59:484 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 14:48:59:484 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 14:48:59:490 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 14:48:59:490 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[node301:22089:0:22090] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x149ea6032008)
==== backtrace (tid: 22090) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x149e3a28b2a4]
1 /usr/lib64/libucs.so.0(+0x2347c) [0x149e3a28b47c]
2 /usr/lib64/libucs.so.0(+0x2364a) [0x149e3a28b64a]
3 [0x149e852414e0]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000149e852414e0, pid=22089, tid=22090
#
# JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 292 c1 java.util.Objects.requireNonNull(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; java.base@11.0.2 (15 bytes) @ 0x0000149e852414e0 [0x0000149e85241460+0x0000000000000080]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/core.22089)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/hs_err_pid22089.log
Compiled method (c1) 1140 292 3 java.util.Objects::requireNonNull (15 bytes)
total in heap [0x0000149e85241290,0x0000149e85241748] = 1208
relocation [0x0000149e85241408,0x0000149e85241448] = 64
main code [0x0000149e85241460,0x0000149e85241600] = 416
stub code [0x0000149e85241600,0x0000149e852416a8] = 168
metadata [0x0000149e852416a8,0x0000149e852416b0] = 8
scopes data [0x0000149e852416b0,0x0000149e852416e0] = 48
scopes pcs [0x0000149e852416e0,0x0000149e85241740] = 96
dependencies [0x0000149e85241740,0x0000149e85241748] = 8
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[node301:22089] *** Process received signal ***
[node301:22089] Signal: Aborted (6)
[node301:22089] Signal code: (-6)
[node301:22089] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x149ea533637f]
[node301:22089] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x149ea5320db5]
[node301:22089] [ 3] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x149ea4a72be9]
[node301:22089] [ 4] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x149ea4c9b619]
[node301:22089] [ 5] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x149ea4c9be9b]
[node301:22089] [ 6] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x149ea4c9bece]
[node301:22089] [ 7] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe2aedd)[0x149ea4c9cedd]
[node301:22089] [ 8] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [ 9] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0x75f834)[0x149ea45d1834]
[node301:22089] [10] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0x75ee65)[0x149ea45d0e65]
[node301:22089] [11] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe25376)[0x149ea4c97376]
[node301:22089] [12] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe27b22)[0x149ea4c99b22]
[node301:22089] [13] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe293b7)[0x149ea4c9b3b7]
[node301:22089] [14] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x149ea4c9be9b]
[node301:22089] [15] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x149ea4c9bece]
[node301:22089] [16] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x149ea4a7da00]
[node301:22089] [17] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x149ea4a715e8]
[node301:22089] [18] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [19] [0x149e852414e0]
[node301:22089] *** End of error message ***
[INFO] 14:48:59:769 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=false}
[INFO] 14:48:59:783 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 14:48:59:784 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 14:48:59:784 role.Node.<init>(): Node created: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}
[INFO] 14:49:00:790 testframework.TestFramework._doPingTests(): Starting ping-pong tests...
[node301:22088:0:22091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace (tid: 22091) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x149785e4d2a4]
1 /usr/lib64/libucs.so.0(+0x2347c) [0x149785e4d47c]
2 /usr/lib64/libucs.so.0(+0x2364a) [0x149785e4d64a]
3 [0x1497d8725ef4]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00001497d8725ef4, pid=22088, tid=22091
#
# JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 440 c2 java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; java.base@11.0.2 (8 bytes) @ 0x00001497d8725ef4 [0x00001497d8725ec0+0x0000000000000034]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/core.22088)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/hs_err_pid22088.log
Compiled method (c2) 2222 440 4 java.lang.StringBuilder::append (8 bytes)
total in heap [0x00001497d8725d10,0x00001497d87267d8] = 2760
relocation [0x00001497d8725e88,0x00001497d8725eb8] = 48
main code [0x00001497d8725ec0,0x00001497d87264a0] = 1504
stub code [0x00001497d87264a0,0x00001497d87264b8] = 24
metadata [0x00001497d87264b8,0x00001497d8726500] = 72
scopes data [0x00001497d8726500,0x00001497d87266a8] = 424
scopes pcs [0x00001497d87266a8,0x00001497d8726788] = 224
dependencies [0x00001497d8726788,0x00001497d8726790] = 8
handler table [0x00001497d8726790,0x00001497d87267a8] = 24
nul chk table [0x00001497d87267a8,0x00001497d87267d8] = 48
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
[node301:22088] *** Process received signal ***
[node301:22088] Signal: Aborted (6)
[node301:22088] Signal code: (-6)
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[node301:22088] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x1497f1729c20]
[node301:22088] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x1497f0f7437f]
[node301:22088] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x1497f0f5edb5]
[node301:22088] [ 3] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x1497f06b0be9]
[node301:22088] [ 4] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x1497f08d9619]
[node301:22088] [ 5] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x1497f08d9e9b]
[node301:22088] [ 6] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x1497f08d9ece]
[node301:22088] [ 7] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x1497f06bba00]
[node301:22088] [ 8] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x1497f06af5e8]
[node301:22088] [ 9] /usr/lib64/libpthread.so.0(+0x12c20)[0x1497f1729c20]
[node301:22088] [10] [0x1497d8725ef4]
[node301:22088] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node node301 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
I hope this information helps:
@jsquyres said:
I see that you're using UCX.
I also tried to run this on a cluster with an Omnipath network. This yields the same errors.
@jsquyres said:
run with my local Open MPI Java
When I try to run the example with my local installation (WSL - Ubuntu 20.04) everything works fine.
Also, I tried to call the ompi-c-api directly via Java's Foreign Function & Memory Api. (This api is still in incubator state, but I though I'd give it a try - it is something similar to jni, but calls c-binaries directly without intermediate binding code as with jni)
While experimenting with this api I ran into the same errors as with the official ompi-java-binding. This let me to speculate that the issue doesn't lie within the ompi-java-bindings, but maybe OpenMPI and the JVM somehow run into a conflict.
Does your app (and/or this sample ping-pong app) use any non-blocking MPI methods? If so, are direct buffers used?
The Ping-Pong example uses iSend
and iRecv
in two places. Both seem to be using direct buffers as suggested by the documentation.
The example from #10158 for example has no send or receive at all, but also crashes on a string operation as well. Due to this behaviour I would think that this might not be related to sending messages or maybe not even related to the Ompi-Java-Binding.
Hi, this issue was stale for a while. @jsquyres or @ggouaillardet do you have any suggestion on how I could narrow down the reason for the error I am seing?
@Janekdererste @jsquyres I'm having the same error with OpenMPI 4.0.1, 4.1.0 and also with IntelMPI.
The error is not happening when OpenMPI is compiled without UCX support.
I managed to mitigate it (I'm not saying solved) by switching off the JIT compiler in the JVM.
java -Djava.compiler=NONE <MyClass>
Hope this helps.
Thanks @stefanovaleri . I pivoted to re-implementing my model in Rust 🙃
Background information
This is related to #10158 . I am opening a separate issue hoping to provide a better example of the error I am seeing.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source
Please describe the system on which you are running
Details of the problem
As recomendet by the Howard from the mailing list, I tried to rung ping-pong-mpi-tcp project. When I run the program in a similar fashion to
mpi.sh
in the repository. I receive a segmentation fault. This error happens inStringBuilder.append(String)
. The error report of the jvm is attached here: hs_err_pid31983.log. The log of the application looks like the following:The program was started with the following command:
Any help would be much appreciated.