open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

Upgrade v2.0.1 to PMIx v1.1.5 #1930

Closed jsquyres closed 8 years ago

jsquyres commented 8 years ago

Per discussion on the webex this morning, upgrade v2.0.1 to PMIx to v1.1.5.

IBM stated that there may be some issues with this (i.e., this might be larger than anticipated). IBM: Can you fill in here?

gpaulsen commented 8 years ago

We are shipping Spectrum MPI v10.1 (based on Open MPI 2.0.0) using the external PMIx component. We started off using PMIx 1.1.2, later applied some patches, then later tried upgrading to PMIx 1.1.4, and finally backed off due to single node scalability performance issues (blamed at the time on PMIx) to PMIx 1.1.2 + some patches.

When I asked @dsolt about more details about this he said:

I don't know that we ever did a root cause on why 1.1.2 and 1.1.4 where not compatible, but they were not. I think they both present the same interface to the clients, but the communication between client and server changed somewhere. I just know that every time there is/was a mismatch between MPI and our pmix_server, it would fail with "unpack failure" messages. If Ralph doesn't know, then there isn't much hope.

rhc54 commented 8 years ago

FWIW: I think @dsolt has spent too much time in the sun 😄 I can find no issues with running PMIx 1.1.5rc inside of OMPI v2.0.1. I suspect any problems are due to a stale external integration component in the OPAL pmix framework. I'll have to look at that separately.

jsquyres commented 8 years ago

Also FWIW, Open MPI still requires that the versions of Open MPI must match between all processes in a job. We do not make any guarantees about what happens if you run a job with some processes using Open MPI version X.Y.Z and other processes using Open MPI version A.B.C.

gpaulsen commented 8 years ago

RE: mismatch versions in a job, I'm sure that's not what this was regarding. I want to say that the timing of MPI_Init and/or MPI_Finalize for a single node job of around 120 ranks per node showed the degradation. Dave, you remember anything else with the Communit PMIx code 1.1.3 or 1.1.4? If not, perhaps just integrate it and move forward.

rhc54 commented 8 years ago

hmmm...i have no way to test performance at that scale, so any guidance and/or verification would be helpful

jladd-mlnx commented 8 years ago

@gpaulsen @rhc54 @artpol84 @dsolt On Firestone servers running 176 threads per host, it seems to me like the shared memory dstore component that was implemented by @elenash would be highly beneficial. This feature isn't available until PMIx 2.x which will be available in OMPI 2.1.0 if I'm not mistaken.

rhc54 commented 8 years ago

Here are the MTT results for the update to 1.1.5:

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 2.0.1a1     | 00:01    | 1    |      |          |      | MPI_Install-my_installation-my_installation-2.0.1a1-my_installation.html |
| Test Build  | trivial         | 2.0.1a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-2.0.1a1-my_installation.html          |
| Test Build  | ibm             | 2.0.1a1     | 00:46    | 1    |      |          |      | Test_Build-ibm-my_installation-2.0.1a1-my_installation.html              |
| Test Build  | intel           | 2.0.1a1     | 01:21    | 1    |      |          |      | Test_Build-intel-my_installation-2.0.1a1-my_installation.html            |
| Test Build  | java            | 2.0.1a1     | 00:03    | 1    |      |          |      | Test_Build-java-my_installation-2.0.1a1-my_installation.html             |
| Test Build  | orte            | 2.0.1a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-2.0.1a1-my_installation.html             |
| Test Run    | trivial         | 2.0.1a1     | 00:06    | 6    |      |          |      | Test_Run-trivial-my_installation-2.0.1a1-my_installation.html            |
| Test Run    | ibm             | 2.0.1a1     | 10:48    | 440  |      |          | 3    | Test_Run-ibm-my_installation-2.0.1a1-my_installation.html                |
| Test Run    | spawn           | 2.0.1a1     | 00:08    | 7    |      |          |      | Test_Run-spawn-my_installation-2.0.1a1-my_installation.html              |
| Test Run    | loopspawn       | 2.0.1a1     | 09:44    | 1    |      |          |      | Test_Run-loopspawn-my_installation-2.0.1a1-my_installation.html          |
| Test Run    | intel           | 2.0.1a1     | 24:04    | 474  |      |          | 4    | Test_Run-intel-my_installation-2.0.1a1-my_installation.html              |
| Test Run    | intel_skip      | 2.0.1a1     | 11:47    | 431  |      |          | 47   | Test_Run-intel_skip-my_installation-2.0.1a1-my_installation.html         |
| Test Run    | java            | 2.0.1a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-2.0.1a1-my_installation.html               |
| Test Run    | orte            | 2.0.1a1     | 00:44    | 19   |      |          |      | Test_Run-orte-my_installation-2.0.1a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+

I agree with @jladd-mlnx . The concern raised here, though, was about degradation when moving from PMIx 1.1.2 to 1.1.5. I'm not sure why that would happen, but will try to take a look

gpaulsen commented 8 years ago

My understanding was that 1.1.2 -> 1.1.4 some functionality was moved either from the client to the server, or vice versa which exasperated the problem. 2.0 with the shared memory should solve this, but wanted to mention this as a possible down side of going to 1.1.5. The fixed in 1.1.5 might still outweigh this issue in the OMPI 2.0.1 timeframe though. Thanks for looking into this Ralph.

rhc54 commented 8 years ago

Aha! I know what it is. In 1.1.2, the fence would bring all the data down to the local procs. Starting in 1.1.3, we stop the data at the PMIx server in order to save memory footprint. So if you want data for a specific proc, you have to query the server to get it. We do bring down all the data for that proc, but that still means you have a query/peer.

This was done because OMPI doesn't need the data from every proc at startup - we only bring it down on first message, and connectivity is typically sparse. However, PAMI has a little loop that grabs all peer data during init, and that is what causes the problem.

The shared memory will help with that situation as the data will all be in SM, and thus you eliminate that communication. You folks should also address the PAMI issue separately as that isn't a scalable method going into the future.

As for the referenced PR: I would suggest we go ahead and accept it as the scaling concern pointed at here is an exception use-case outside OMPI's norm.

gpaulsen commented 8 years ago

Thanks for figuring this out based on my vague recollections. Glad to hear this won't affect the community OMPI.

jsquyres commented 8 years ago

This is addressed in https://github.com/open-mpi/ompi-release/pull/1277.

artpol84 commented 8 years ago

@jsquyres actually 1.1.5 doesn't have dstor so I'd expect that mentioned PR won't fix the problem.

jsquyres commented 8 years ago

@artpol84 I think the goal for this PR was just to get accumulated PMIx bug fixes into the v2.0.x tree. I agree that the shared memory dstore stuff in PMIx 2.x would be beneficial, but is probabbly more appropriate for OMPI v2.1.x.

artpol84 commented 8 years ago

Agree.