open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. #12503

Open mikiefromhell opened 4 months ago

mikiefromhell commented 4 months ago

Background information

Hello! I am running CP2K (a molecular dynamics simulation software) on a shell connected remotely to a supercomputer. I tried submitting a job today, and it did not quite work.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

3.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

This line is part of the sbatch file that I use to run jobs:

module load openmpi/3.1.1

The module is loaded from Discovery on Open On Demand (Northeastern U's supercomputer)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

/

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
mikiefromhell commented 4 months ago

I found this thread and checked my PATH and LD_LIBRARY_PATH https://github.com/horovod/horovod/issues/133

The LD_LIBRARY_PATH does not exist, but I am not an admin on the server and I do not think that's all there is to the problem because I was able to run a different job a few days ago!

jsquyres commented 4 months ago

The error message is telling you that your application decided to abort for some reason (i.e., it called the MPI_ABORT API function). I'm unfamiliar with CP2K, so I don't know why it would have done that. You might want to look through the output and see if there's other warning/error messages before the abort message.

Also, Open MPI v3.1.1 is fairly ancient. At a bare minimum, I would suggest upgrading to the latest 3.1.x version (v3.1.6), because it contains bunches of bug fixes beyond 3.1.1.

That being said, 3.1.6 is from March of 2020, and is still pretty ancient. We are unlikely to ever make any more releases in the v3.1.x series.

The most recent version of Open MPI is v5.0.3 -- I'd suggest upgrading to that.

mikiefromhell commented 4 months ago

Hello @jsquyres Jeff, Thank you for your response! That was actually the only message in the output and no error file was created. I understand that it is an ancient version, but this server is unfortunately not managed by me and the CP2K package relies on the 3.1.1 version: this is what comes up when I type module show cp2k image Unfortunately, the most recent version of openmpi I have access to is 4.1.4.

I also tried running a different simulation and I got another MPI error, albeit a different one:

[[57845,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: c0279 Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

jsquyres commented 4 months ago

With Open MPI v4.1.4, it looks like you got an additional warning but the same underlying error (i.e., the application invoked MPI_ABORT). The CP2K application has chosen to abort; you'll have to look at their docs and/or source code for more information on why the application chose to abort.

I'm afraid we can't help you with whatever environment NEU has setup to run CP2K, nor can we help with CP2K itself -- we're not involved in either of those organizations.

mikiefromhell commented 4 months ago

Hello Jeff,

I was able to run a few CP2K jobs from a tutorial website - the Shell still outputs MPI errors, but no aborts. I am assuming, like you suggested, that it is a problem with my input files, and not the MPI package. thank you!

samyog111 commented 3 weeks ago

I am also getting this error while running the mesher in the Specfem2D. MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 30.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. Can you please tell me what causes this error?

jsquyres commented 3 weeks ago

@samyog111 See https://github.com/open-mpi/ompi/issues/12503#issuecomment-2083590684.