open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

Java Segmentation Fault in ping-pong-mpi-tcp #10223

Open Janekdererste opened 2 years ago

Janekdererste commented 2 years ago

Background information

This is related to #10158 . I am opening a separate issue hoping to provide a better example of the error I am seeing.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source

Please describe the system on which you are running


Details of the problem

As recomendet by the Howard from the mailing list, I tried to rung ping-pong-mpi-tcp project. When I run the program in a similar fashion to mpi.sh in the repository. I receive a segmentation fault. This error happens in StringBuilder.append(String). The error report of the jvm is attached here: hs_err_pid31983.log. The log of the application looks like the following:

2022-04-05 16:50:33 [main] MPIMain.main()
INFO: Args received: [2, false]
2022-04-05 16:50:33 [main] MPIMain.main()
INFO: Args received: [2, false]
[INFO] 16:50:34:294 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 16:50:34:294 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 16:50:34:298 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 16:50:34:298 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 16:50:35:494 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=false}
[INFO] 16:50:35:503 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}
[INFO] 16:50:35:540 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 16:50:35:540 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 16:50:35:541 role.Node.<init>(): Node created: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}
[INFO] 16:50:35:541 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=true}]
[INFO] 16:50:35:541 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}]
[INFO] 16:50:35:542 role.Node.<init>(): Node created: Role{roleId='p1g2', myAddress=MPIAddress{rank=1, groupId=2}, isLeader=false}
[node500:31983:0:31990] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14)
[INFO] 16:50:36:566 testframework.TestFramework._doPingTests(): Starting ping-pong tests...
==== backtrace (tid:  31990) ====
 0  /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x1512416f32a4]
 1  /usr/lib64/libucs.so.0(+0x2347c) [0x1512416f347c]
 2  /usr/lib64/libucs.so.0(+0x2364a) [0x1512416f364a]
 3  [0x1512949256d4]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00001512949256d4 (sent by kill), pid=31983, tid=31990
#
# JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
# Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# J 448 c2 java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; java.base@17.0.2 (8 bytes) @ 0x00001512949256d4 [0x00001512949256a0+0x0000000000000034]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/core.31983)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/hs_err_pid31983.log
Compiled method (c2)    3781  448       4       java.lang.StringBuilder::append (8 bytes)
 total in heap  [0x0000151294925510,0x0000151294925fb0] = 2720
 relocation     [0x0000151294925670,0x00001512949256a0] = 48
 main code      [0x00001512949256a0,0x0000151294925be0] = 1344
 stub code      [0x0000151294925be0,0x0000151294925bf8] = 24
 metadata       [0x0000151294925bf8,0x0000151294925c50] = 88
 scopes data    [0x0000151294925c50,0x0000151294925e88] = 568
 scopes pcs     [0x0000151294925e88,0x0000151294925f68] = 224
 dependencies   [0x0000151294925f68,0x0000151294925f70] = 8
 handler table  [0x0000151294925f70,0x0000151294925f88] = 24
 nul chk table  [0x0000151294925f88,0x0000151294925fb0] = 40
[node500:31983] *** Process received signal ***
[node500:31983] Signal: Aborted (6)
[node500:31983] Signal code:  (-6)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
[node500:31983] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x1512aa492c20]
[node500:31983] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x1512a9eee37f]
[node500:31983] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x1512a9ed8db5]
[node500:31983] [ 3] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0x246cc9)[0x1512a8e90cc9]
[node500:31983] [ 4] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0e70c)[0x1512a9a5870c]
[node500:31983] [ 5] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0f12b)[0x1512a9a5912b]
[node500:31983] [ 6] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(+0xe0f15e)[0x1512a9a5915e]
[node500:31983] [ 7] /net/homes/ils/laudan/jdk-17.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x198)[0x1512a9906148]
[node500:31983] [ 8] /usr/lib64/libpthread.so.0(+0x12c20)[0x1512aa492c20]
[node500:31983] [ 9] [0x1512949256d4]
[node500:31983] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node500 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The program was started with the following command:

OMPI="/homes2/ils/laudan/ompi-java17/bin/mpirun"
JAVA="/homes2/ils/laudan/jdk-17.0.2/bin/java"
$OMPI --mca pml ucx -np 2\
 $JAVA -cp ping-pong-mpi-tcp-1.0-SNAPSHOT-jar-with-dependencies.jar MPIMain 2 false

Any help would be much appreciated.

jsquyres commented 2 years ago

Forgive me; I'm fairly ignorant of Java. I tried to replicate the problem, but I notice that when I run the mpi.sh from the https://github.com/mboysan/ping-pong-mpi-tcp project, the maven build downloads an Open MPI jarfile for v4.0.1. I assume it then uses that jarfile to run the application.

How do I get it to use my local Open MPI Java install?

Janekdererste commented 2 years ago

Thank you very much for trying it out. This is a little tedious, unfortunately.

The jar has to be in some kind of maven repository. You could create a local maven repository like this:

$ mvn deploy:deploy-file \
-Durl=file:///path/to/where/you/want/the/local/maven/repo/to/be \
-Dfile=/path/to/your/mpi.jar \
-DgroupId=org.openmpi \
-DartifactId=mpi \
-Dpackaging=jar \
-Dversion=4.1.2

ArtifactId, groupId and version can be chosen how ever you like

Then in the pom.xml, the local repository needs to be added to the <repositories> section. Like this:

<repository>
    <id>Local-mpi</id>
    <url>file:///path/to/where/you/want/the/local/maven/repo/to/be</url>
</repository>

Then in the <dependecies> section of the pom.xml file you can replace the mpi dependency with the following:

<dependency>
        <groupId>org.openmpi</groupId>
        <artifactId>mpi</artifactId>
        <version>4.1.2</version>
</dependency>

GroupId, artifactId and version must match what you have specified in the mvn:deploy command.

jsquyres commented 2 years ago

Thanks! Let me give this a whirl.

jsquyres commented 2 years ago

@Janekdererste Thanks for the instructions -- with that, I got the ping pong to compile and run with my local Open MPI Java build.

Unfortunately, it runs successfully for me. However, I see that you're using UCX. I wonder if there's some kind of conflict here with registered memory for InfiniBand...?

@open-mpi/ucx Can you please have a look at this? NOTE: While the same issue undoubtedly exists in main/v5.0.x, be aware of #10245 in terms of installing PMIx/PRTE in the same prefix as OMPI (at least until the issue is resolved).

yosefe commented 2 years ago

@Janekdererste can you please try without ucx (replace "--mca pml ucx" by "-mca btl tcp")? @jsquyres how do the Java OpenMPI bindings work in the sense of making sure the buffer virtual address is not moved before a non-blocking MPI request is completed?

jsquyres commented 2 years ago

@yosefe IIRC, the Java bindings use some sort of special type of Java buffer that is effectively pinned.

ggouaillardet commented 2 years ago

@Janekdererste in order to force btl/tcp you will need to

mpirun --mca pml ob1 --mca btl tcp,self
jsquyres commented 2 years ago

@yosefe Check out https://docs.open-mpi.org/en/v5.0.x/features/java.html#how-to-specify-buffers. These are the upcoming 5.0 docs, but the content basically hasn't changed since Open MPI v4.x.

jsquyres commented 2 years ago

@Janekdererste While researching this issue, I also updated the Java Open MPI docs. I came across this statement in the docs:

All non-blocking methods must use direct buffers and only blocking methods can choose between arrays and direct buffers.

Does your app (and/or this sample ping-pong app) use any non-blocking MPI methods? If so, are direct buffers used?

Janekdererste commented 2 years ago

Thanks all for helping on this issue. I tried to run the program with mpirun --mca pml ob1 --mca btl tcp,self -np 2 java -cp ./target/*jar-with-dependencies.jar MPIMain 2 false as suggested by @ggouaillardet. The program seems to crash in the same place. The jvm dumps (hs_err_pid22088.log hs_err_pid22089.log) of the two processes and the log looks like this:

2022-04-11 14:48:58 [main] MPIMain.main()
INFO: Args received: [2, false]
2022-04-11 14:48:58 [main] MPIMain.main()
INFO: Args received: [2, false]
[INFO] 14:48:59:484 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 14:48:59:484 config.GlobalConfig.initMPI(): Thread support level: 0
[INFO] 14:48:59:490 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[INFO] 14:48:59:490 config.GlobalConfig.init(): Init [MPI_CONNECTION, isSingleJVM:false]
[node301:22089:0:22090] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x149ea6032008)
==== backtrace (tid:  22090) ====
 0  /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x149e3a28b2a4]
 1  /usr/lib64/libucs.so.0(+0x2347c) [0x149e3a28b47c]
 2  /usr/lib64/libucs.so.0(+0x2364a) [0x149e3a28b64a]
 3  [0x149e852414e0]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000149e852414e0, pid=22089, tid=22090
#
# JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 292 c1 java.util.Objects.requireNonNull(Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; java.base@11.0.2 (15 bytes) @ 0x0000149e852414e0 [0x0000149e85241460+0x0000000000000080]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/core.22089)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/hs_err_pid22089.log
Compiled method (c1)    1140  292       3       java.util.Objects::requireNonNull (15 bytes)
 total in heap  [0x0000149e85241290,0x0000149e85241748] = 1208
 relocation     [0x0000149e85241408,0x0000149e85241448] = 64
 main code      [0x0000149e85241460,0x0000149e85241600] = 416
 stub code      [0x0000149e85241600,0x0000149e852416a8] = 168
 metadata       [0x0000149e852416a8,0x0000149e852416b0] = 8
 scopes data    [0x0000149e852416b0,0x0000149e852416e0] = 48
 scopes pcs     [0x0000149e852416e0,0x0000149e85241740] = 96
 dependencies   [0x0000149e85241740,0x0000149e85241748] = 8
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[node301:22089] *** Process received signal ***
[node301:22089] Signal: Aborted (6)
[node301:22089] Signal code:  (-6)
[node301:22089] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x149ea533637f]
[node301:22089] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x149ea5320db5]
[node301:22089] [ 3] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x149ea4a72be9]
[node301:22089] [ 4] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x149ea4c9b619]
[node301:22089] [ 5] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x149ea4c9be9b]
[node301:22089] [ 6] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x149ea4c9bece]
[node301:22089] [ 7] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe2aedd)[0x149ea4c9cedd]
[node301:22089] [ 8] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [ 9] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0x75f834)[0x149ea45d1834]
[node301:22089] [10] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0x75ee65)[0x149ea45d0e65]
[node301:22089] [11] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe25376)[0x149ea4c97376]
[node301:22089] [12] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe27b22)[0x149ea4c99b22]
[node301:22089] [13] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe293b7)[0x149ea4c9b3b7]
[node301:22089] [14] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x149ea4c9be9b]
[node301:22089] [15] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x149ea4c9bece]
[node301:22089] [16] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x149ea4a7da00]
[node301:22089] [17] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x149ea4a715e8]
[node301:22089] [18] /usr/lib64/libpthread.so.0(+0x12c20)[0x149ea5aebc20]
[node301:22089] [19] [0x149e852414e0]
[node301:22089] *** End of error message ***
[INFO] 14:48:59:769 config.GlobalConfig.registerRole(): Registering role: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=false}
[INFO] 14:48:59:783 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=0, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 14:48:59:784 config.GlobalConfig.registerAddress(): Address [MPIAddress{rank=1, groupId=2}] registered on role [Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}]
[INFO] 14:48:59:784 role.Node.<init>(): Node created: Role{roleId='p0g2', myAddress=MPIAddress{rank=0, groupId=2}, isLeader=true}
[INFO] 14:49:00:790 testframework.TestFramework._doPingTests(): Starting ping-pong tests...
[node301:22088:0:22091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace (tid:  22091) ====
 0  /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x149785e4d2a4]
 1  /usr/lib64/libucs.so.0(+0x2347c) [0x149785e4d47c]
 2  /usr/lib64/libucs.so.0(+0x2364a) [0x149785e4d64a]
 3  [0x1497d8725ef4]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00001497d8725ef4, pid=22088, tid=22091
#
# JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 440 c2 java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; java.base@11.0.2 (8 bytes) @ 0x00001497d8725ef4 [0x00001497d8725ec0+0x0000000000000034]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/core.22088)
#
# An error report file with more information is saved as:
# /net/ils/laudan/2-mpi-test/ping-pong-mpi-tcp/hs_err_pid22088.log
Compiled method (c2)    2222  440       4       java.lang.StringBuilder::append (8 bytes)
 total in heap  [0x00001497d8725d10,0x00001497d87267d8] = 2760
 relocation     [0x00001497d8725e88,0x00001497d8725eb8] = 48
 main code      [0x00001497d8725ec0,0x00001497d87264a0] = 1504
 stub code      [0x00001497d87264a0,0x00001497d87264b8] = 24
 metadata       [0x00001497d87264b8,0x00001497d8726500] = 72
 scopes data    [0x00001497d8726500,0x00001497d87266a8] = 424
 scopes pcs     [0x00001497d87266a8,0x00001497d8726788] = 224
 dependencies   [0x00001497d8726788,0x00001497d8726790] = 8
 handler table  [0x00001497d8726790,0x00001497d87267a8] = 24
 nul chk table  [0x00001497d87267a8,0x00001497d87267d8] = 48
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
[node301:22088] *** Process received signal ***
[node301:22088] Signal: Aborted (6)
[node301:22088] Signal code:  (-6)
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[node301:22088] [ 0] /usr/lib64/libpthread.so.0(+0x12c20)[0x1497f1729c20]
[node301:22088] [ 1] /usr/lib64/libc.so.6(gsignal+0x10f)[0x1497f0f7437f]
[node301:22088] [ 2] /usr/lib64/libc.so.6(abort+0x127)[0x1497f0f5edb5]
[node301:22088] [ 3] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x1497f06b0be9]
[node301:22088] [ 4] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x1497f08d9619]
[node301:22088] [ 5] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x1497f08d9e9b]
[node301:22088] [ 6] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x1497f08d9ece]
[node301:22088] [ 7] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x1497f06bba00]
[node301:22088] [ 8] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x1497f06af5e8]
[node301:22088] [ 9] /usr/lib64/libpthread.so.0(+0x12c20)[0x1497f1729c20]
[node301:22088] [10] [0x1497d8725ef4]
[node301:22088] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node node301 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Janekdererste commented 2 years ago

I hope this information helps:

@jsquyres said:

I see that you're using UCX.

I also tried to run this on a cluster with an Omnipath network. This yields the same errors.

@jsquyres said:

run with my local Open MPI Java

When I try to run the example with my local installation (WSL - Ubuntu 20.04) everything works fine.

Also, I tried to call the ompi-c-api directly via Java's Foreign Function & Memory Api. (This api is still in incubator state, but I though I'd give it a try - it is something similar to jni, but calls c-binaries directly without intermediate binding code as with jni)

While experimenting with this api I ran into the same errors as with the official ompi-java-binding. This let me to speculate that the issue doesn't lie within the ompi-java-bindings, but maybe OpenMPI and the JVM somehow run into a conflict.

Janekdererste commented 2 years ago

Does your app (and/or this sample ping-pong app) use any non-blocking MPI methods? If so, are direct buffers used?

The Ping-Pong example uses iSend and iRecv in two places. Both seem to be using direct buffers as suggested by the documentation.

The example from #10158 for example has no send or receive at all, but also crashes on a string operation as well. Due to this behaviour I would think that this might not be related to sending messages or maybe not even related to the Ompi-Java-Binding.

Janekdererste commented 2 years ago

Hi, this issue was stale for a while. @jsquyres or @ggouaillardet do you have any suggestion on how I could narrow down the reason for the error I am seing?

stefanovaleri commented 1 year ago

@Janekdererste @jsquyres I'm having the same error with OpenMPI 4.0.1, 4.1.0 and also with IntelMPI. The error is not happening when OpenMPI is compiled without UCX support. I managed to mitigate it (I'm not saying solved) by switching off the JIT compiler in the JVM. java -Djava.compiler=NONE <MyClass>

Hope this helps.

Janekdererste commented 1 year ago

Thanks @stefanovaleri . I pivoted to re-implementing my model in Rust 🙃