MPI Connect/accept broken except when from within a single mpirun

rhc54 commented 7 years ago

Thank you for taking the time to submit an issue!

Background information

Multiple users have reported that MPI connect/accept no longer works when executed between two applications started by separate cmd lines. This includes passing the "port" on the cmd line, and use of ompi-server as the go-between

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Sadly, this goes back to the 2.x series and continues thru 3.x to master

Details of the problem

When we switched to PMIx for our wireup, the "port" no longer represents a typical TCP URI. It instead contains info PMIx needs for publish/lookup to rendezvous. Fixing the problem requires a little thought as application procs no longer have access to the OOB, and we'd rather not revert back to doing so.

rhc54 commented 7 years ago

Fixed in master by checking for ompi-server presence (if launched by mpirun), or availability of publish/lookup support if direct launched, and outputting a friendly show-help message if not. Operation of ompi-server was also repaired for v3.0.

Backports to the 2.x series are not planned.

tjb900 commented 7 years ago

(apologies in advance if I should have opened a new issue instead)

@rhc54 Thanks very much for looking into this - I was one of the ones hoping to use this feature. Unfortunately it still seems to be giving an error (though a different one this time):

[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20406] [[15789,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401
[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20417] [[15800,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401

I've attached a relatively relatively simple reproducer, which for me gives the above errors on today's master (1f799afa3).

test3.zip

A different test, using connect/accept within a single mpirun instance, still works fine.

rhc54 commented 7 years ago

I've gone back and looked at where this stands, and found that I had fixed ompi-server, but there was still some work left to resolve the cross-mpirun connect issue. I've taken it as far as I have time right now and will commit those changes. However, it won't fix the problem and so we won't port it to a release branch.

There remains an issue over how the callbacks are flowing at the end of the connect operation. An object is apparently being released at an incorrect time.

I'm not sure who will be picking this up. Sorry I can't be of more help.

rhc54 commented 6 years ago

@hppritcha Just a reminder - this is still hanging around.

rhc54 commented 6 years ago

FWIW: the connect/disconnect support was never implemented in ORTE for the v2.x series

derangedhk417 commented 5 years ago

Perhaps this is a stupid question, as I am not very familiar with Github. Is this actively being worked on? I am running into this problem as well.

rhc54 commented 5 years ago

Not at the moment it is considered a low priority, I'm afraid, and we don't have anyone focused on it.

derangedhk417 commented 5 years ago

Thanks for the response. If anyone reading this is interested, I have written a reusable workaround for this problem. If anyone shows interest I'll clean it up and put it in a public repo. (It's not a source modification, its a separate .h file)

Summerdave commented 5 years ago

Yes i am interested. We had to disable some functionality when running on OpenMPI because of this. How did you get around it?

derangedhk417 commented 5 years ago

I have the code up here (https://github.com/derangedhk417/mpi_controller). It's just a basic wrapper around some POSIX shared memory functions. It makes use of semaphores to handle synchronization between the controller and the child. I haven't exactly made this super user friendly, but it should do the trick. I'll try to add some documentation in the next few hours.

Notes:

You need mpirun to be in your PATH for it to work.
All communication is blocking. There is no message queuing.
Feel free to modify this and make it better as you see fit.
I strongly recommend that you read through and understand the code before you use it. This was written hastily and probably has bugs.

rhc54 commented 5 years ago

Another option was brought to my attention today. If you know that one of the mpirun executions will always be running, then you can point the other mpirun's to it as the "ompi-server" like this:

$ mpirun -n 3 --report-uri myuri.txt myapp &
$ mpirun -n 2 --ompi-server file:myuri.txt myotherapp

This makes the first mpirun act as the global server. I'm not sure it will solve the problem, but it might be worth trying.

nelsonspbr commented 5 years ago

@rhc54 This hasn't worked at least for me, unfortunately :(

Has there been any updates on this?

rhc54 commented 5 years ago

Not really - the developer community judged it not worth fixing and so it has sat idle. Based on current plans, it will be fixed in this year's v5.0 release - but not likely before then.

Note that you can optionally execute your OMPI job against the PMIx Reference RTE (PRRTE). I believe this is working in that environment. See https://pmix.org/support/how-to/running-apps-under-psrvr/ for info.

jrhemstad commented 5 years ago

@rhc54 I wanted to let you know that support for these APIs are important to us in Dask. See https://github.com/dask/dask-mpi/issues/25

Our use-case is we need a way to create MPI processes from already existing processes (without launching a new process) and build up a communicator among these processes.

maddyscientist commented 5 years ago

This issue is also a blocker for our use of OpenMPI with our MPI job manager (mpi_jm) which we use to increase job utilization on large supercomputers for sub-nuclear physics simulations (https://arxiv.org/pdf/1810.01609.pdf). This has forced us to use MVAPICH, which compared to OpenMPI (or Spectrum MPI) results in reduced performance, but correctness is godliness in comparison.

(We here being CalLat - collaboration of physicists centred at LLNL and LBNL, using Summit, Sierra, Titan, etc.).

rhc54 commented 5 years ago

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

rhc54 commented 5 years ago

Okay, you guys - the fix is here: https://github.com/open-mpi/ompi/pull/6439

Once it gets thru CI I'll post a PR to backport it to the release branches.

datametrician commented 5 years ago

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

I can't thank you enough for this! Thank you thank you thank you!

gpaulsen commented 5 years ago

@rhc54 Can this issue be closed?

q2luo commented 4 years ago

MPI_Comm_connect/MPI_Comm_accept in 4.0.2 still do not work except when from a single mpirun. We're stuck at 1.6.5 and can not upgrade to any latest Open MPI releases. Please help fix.

The error message from slave process using 4.0.2 MPI_Connect:

The user has called an operation involving MPI_Comm_connect and/or MPI_Accept that spans multiple invocations of mpirun. This requires the support of the ompi-server tool, which must be executing somewhere that can be accessed by all participants.

Please ensure the tool is running, and provide each mpirun with the MCA parameter "pmix_server_uri" pointing to it.

Your application has invoked an MPI function that is not supported in this environment.

MPI function: MPI_Comm_connect Reason: Underlying runtime environment does not support accept/connect functionality

[sjoq49:426944] An error occurred in MPI_Comm_connect [sjoq49:426944] reported by process [3149791233,47184510713856] [sjoq49:426944] on communicator MPI_COMM_WORLD [sjoq49:426944] MPI_ERR_INTERN: internal error [sjoq49:426944] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [sjoq49:426944] and potentially your MPI job)

The error message from master using 4.0.2 MPI_Comm_accept:

A request has timed out and will therefore fail:

Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to adjust the MCA parameter pmix_server_max_wait and try again. If this occurred during a connect/accept operation, you can adjust that time using the pmix_base_exchange_timeout parameter.

[sjoq64:88026] An error occurred in MPI_Comm_accept [sjoq64:88026] reported by process [1219035137,0] [sjoq64:88026] on communicator MPI_COMM_WORLD [sjoq64:88026] MPI_ERR_UNKNOWN: unknown error [sjoq64:88026] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [sjoq64:88026] and potentially your MPI job)

rhc54 commented 4 years ago

@q2luo I'm not sure how to respond to your request. The error message you show indicates that the mpirun starting the slave process was not given the URI of the ompi-server. Cross-mpirun operations require the support of ompi-server as a rendezvous point.

You might want to try it again, ensuring you follow the required steps. If that doesn't work, please post exactly what you did to encounter the problem.

q2luo commented 4 years ago

@rhc54 I started paying attention to the threads related to this same issue since 2015. I tried many different versions of OpenMPI releases, the last working release is 1.6.5, all releases 1.7.1 or higher have the same problem. I also tried pointing ompi-server with the URI, but no success.

Your May 5 2017 description, at the beginning of this thread, describes the issue very well. In fact, OpenMPI release "list of changes" file also document it as a known issue in 3.0 section:

" -MPI_Connect/accept between applications started by different mpirun commands will fail, even if ompi-server is running."

We use OpenMPI in the following way, example below assumes using 8 hosts from LSF:

1) Issue 8 individual LSF "bsub" command to acquire 8 hosts with specified amount of resources, each host runs our program (Linux based) with OpenMPI enabled.

2) The program on each host calls MPI_Comm_connect() and MPI_Comm_accept(), then MPI_Intercomm_merge() and MPI_Comm_rank() after accept is OK. The goal is to connect all 8 MPI applications into 1 MPI world.

The above 2 steps are to realize the same goal as "mpirun -n 8". "mpirun -n 8" works fine for all OpenMPI releases, but semiconductor industry doesn't allow this usage due to IT policies.

Thanks and regards.

rhc54 commented 4 years ago

Look, I'm happy to help, but you have to provide enough information so I can do so. I need to know how you actually are starting all these programs. Do you have ompi-server running somewhere that all the hosts can reach over TCP? What was the cmd line to start the programs on each host?

I don't know who you mean by "semiconductor industry", but I know of at least one company in that collective that doesn't have this issue 😄 This appears to be a pretty extreme use case, so it isn't surprising that it might uncover some problems.

q2luo commented 4 years ago

Each application is started with "mpirun -n 1" on a host acquired by LSF. I tried in-house by starting an ompi-server and let each individual mpirun pointing to it, but connect/accept still fails. On the other hand, even if it works, it would be impractical to use because it requires the company IT starting and maintaining a central ompi-server.

Yes, all hosts can reach over TCP, because SSH based approach via "mpirun -n 8" works, 1 LSF bsub command with "mpirun -n 8" also works.

rhc54 commented 4 years ago

So let me summarize. You tried doing this with a central ompi-server in a lab setup and it didn't work. Regardless, you cannot use a central ompi-server in your production environment.

I can take a look to ensure that ompi-server is still working on v4.0.2. However, without ompi-server, there is no way this configuration can work on your production system. The very old v1.6 series certainly would work, but it involves a runtime that doesn't scale to today's cluster sizes - so going back to that approach isn't an option.

On the positive side, you might get IBM to add PMIx integration to LSF - in which case, you won't need ompi-server any more. MIght be your best bet.

q2luo commented 4 years ago

@rhc54 Thanks for your explanation. Even if adding PMIx to LSF/MPI hook works, the same problem will be still faced for RTDA, SGE/UGE, etc. grids.

v1.6 series have some serious issues, such memory corruption, network interface cards recognition, etc. All those issue are fixed in latest 3.x and 4.x releases from my testings. Our application normally needs up to 256 hosts each with physical memory at least 512GB up to 3TB, it's normally impossible to acquire 64 such big memory machines instantly so that "mpirun -n 64" will almost never succeed (unless ITs set aside 64 hosts dedicated for one job/person to use). Instead, 64 hosts are normally obtained sequentially by 64 independent grid commands and the time to acquire all these 64 machines can normally span from minutes to hours.

I wonder how connect/accept work in the case of default mode "mpirun -n 32" ? is ompi-server not used or the public API connect/accept not used in this default mode ?

Thanks and regards.

rhc54 commented 4 years ago

The closest MPI comes to really supporting your use-case of the "rolling start" is the MPI Sessions work proposed for v4 of the standard. In the meantime, what I would do is:

Submit an initial job request for just one node. When I get that node, I would start ompi-server on it plus my initial mpirun for that node. I would have ompi-server either report its contact info to a file on a network location, or capture its URI from stdout.
I would then have the script submit the request for the remaining hosts and include the ompi-server URI information in the cmd line to be executed on those hosts

This will allow proper wireup of your connect/accept logic. From your description of your scheduler, it shouldn't cause you any additional delays in getting the desired resources. You might even get your IT folks to setup a "high priority" queue for the secondary submission.

open-mpi / ompi