Open dalcinl opened 1 week ago
@hppritcha Maybe this issue comes from #12853?
Highly likely that's the case.
Unsure exactly what was included in that update, but I was immediately suspicious when that PR was filed because it did not include the corresponding update to the PMIx submodule. As previously noted, the two are tightly coupled - so you should never update one without updating the other.
Might not be the cause here, but it is the first thing I would try.
@hppritcha Please let me know once you open a PR. I would like to suggest to update the configuration of of oversubscription in this repo, such that it is in sync with what I do in mpi4py. I assume that OMPI wants all configuration in $HOME/.openmpi/mca-params.conf honored, including that of PRRTE. Am I right?
Hmm. well actually this is not a bug. See https://docs.open-mpi.org/en/v5.0.x/mca.html#mca-parameter-changes-between-open-mpi-4-x-and-5-x . You'll need to add rmaps_default_mapping_policy = :oversubscribe
to your $HOME/.prte/mca-params.conf, at least for v5.0.x and current main.
The ompi mpi4py github action is kind of misleading here so i'll fix it to not use that deprecated mca param in the $HOME/.openmpi/mca-params.conf file.
@hppritcha I suppose this opens the question I noted on your PR. We could register synonyms for the OMPI v4 form of the PRRTE params. Do we want to do so? Be a little bit of grunge work, but nothing all that bad. May not be possible for all params as they may not have a corresponding OMPI v5 version - or the values may conflict so that PRRTE doesn't understand them. But we could pick up a number of them, and certainly the "oversubscribe" one should be doable.
I've added a check to ensure we are actually processing the OMPI default param files - pretty sure we are, but I'll check. Just need to finish something else first.
So the OMPI documentation @hppritcha pointed to warns use to use THREE different locations to configure Open MPI operation. Are you guys sure you want to put that complexity on users?
I must add a clarification: the purpose of the line setting rmaps_base_oversubscribe = true
is just to be able to test against branch v4.0.x, I know that setting is gone in v5.0.x and main, and that's the reason I also set maps_default_mapping_policy = :oversubscribe
, but all that in the single config file $HOME/.openmpi/mca-params.conf
.
Until a few days ago, oversubscription seemed to work with all branches, and as of today it is still working in v5.0.x. From what @rhc54 said, I understand OMPI from v5.0.x and main simply ignore the old v4.x setting (is that correct?).
Therefore it looks to me that v5.0.x is indeed reading the oversubscription setting from $HOME/.openmpi/mca-params.conf
(otherwise my tests would fail), while the main branch lost such ability. Is my diagnosis correct?
Guys, I have a somewhat explicit question. Are PRRTE and PMIx (when used via/with Open MPI) supposed to read configuration from OMPI's config file $HOME/.openmpi/mca-params.conf
? If that is the case, then this issue looks like a regression. If that is not the case, then I do not understand why mpi4py CI is not failing with ompi@v5.0.x due to oversubscription not being turned on (as mpi4py IS NOT using $HOME/.prte/mca-params.conf`).
I have said several times that I will investigate it - just not a high priority at the moment when compared with other things.
I see @dalcinl . i'll take a further look at what may have caused this change in behavior on our main branch.
So here is the problem. The change came about because of an OMPI reported issue regarding passing of MCA params. We got that fixed. However, what it exposed was that we were overlaying processing of OMPI, PRRTE, and PMIx params on top of each other, thereby causing confusion.
Long story short: the issue here is that mpirun
doesn't know it is working with an OMPI app until after we have already processed the PRRTE and PMIx params, and after we have already initialized much of PRRTE (opened frameworks, etc). So it is far too late in the process to impact PRRTE settings.
This is why we have three param files - to ensure that the settings are applied before anything happens in their respective projects. I personally don't see a simple way around that one - I suppose you could add param file parsing in OMPI's mpirun
code before it exec's prterun
, though that might get a little klunky as you'd have to track mapping of PRRTE and PMIx param names (e.g., to recognize that "rmaps_base_default_mapping_policy" requires conversion to "PRTE_MCA_rmaps_base_default_mapping_policy).
From a user perspective this is mind blowingly complicated and cumbersome. And I'm trying to be polite.
I actually agree - just waiting for someone to propose a realistic solution. Frankly, the whole MCA param thing has been a considerable source of confusion to users over the years - way too many params, far too detailed knowledge of the code base required to really use them. Adding knowledge of which ones to set where does increase that burden.
To recap the problem a bit, in case someone does want to come up with a proposal, imagine the case where someone puts a PRRTE-related value in the OMPI param file, and still has that same param in the PRRTE param file. What precedence should be applied to determining the final value? Which one "wins"?
We could collapse to a single param file, passing its name in the environment to override the project default (e.g., modify OMPI's "mpirun" to pass the name/path to the OMPI param file) - but we'd have to use that name to override looking for any of the other files. Not sure how confusing that will get. If I put something in a PMIx param file, then suddenly that setting disappears because I ran OMPI's "mpirun"?
So I'm open to thoughts - just unsure we really have a "right" answer here.
Afraid that won't work @hppritcha - PRRTE uses the PMIx MCA base. You would have to convert PRRTE to use the OPAL MCA base, which means pulling PRRTE completely inside OMPI ala ORTE...which forces a near-complete rewrite of PRRTE (e.g., every framework would need redo, you'd have to convert to the OPAL object system, etc.). At that point, you might as well go back to ORTE and start from there.
git bisect and code inspection shows that the behavior was changed when we merged the commit from this PR https://github.com/openpmix/prrte/pull/1999 into our fork of prrte.
Not sure whether to consider this a bug or a feature at this point. I guess at a minimum we should document the need to put prrte mca params in the prte/mca-params.conf when not using env. variables or mpirun
command line arguments.
Yeah, the problem lies in the required conversion - need to identify that a param actually belongs to PRRTE so you can correctly prefix it with PRTE_MCA_
so PRRTE will pick it up. In this case, you need to read the OMPI param file(s) and perform the translation on anything found there.
However, PRRTE doesn't know we are using the OMPI personality, and therefore doesn't know we should read the OMPI param file(s), until after we have populated the MCA system. I can investigate perhaps doing another parse at time of selecting the ompi/schizo component - might be in time for some things, but not early enough for some params to have already taken effect. Have to see how the MCA system would deal with duplicate parsing.
Hmm... well our prte could be made to know its doing ompi personality.
I think you'll find things are more complicated than you realize if/when you try! However, you may find it a valuable learning experience.
I have fixed this issue in the referenced PR. As noted in the commit comment, it isn't a perfect solution - however, we do identify the OMPI personality fairly early in the init process, and so most (maybe the vast majority?) of params of interest to users will be caught by it.
Per the commit message:
This still raises precedence questions. For
now, what this will do is have values in the
OMPI param files give way to corresponding
values specified in PRRTE and PMIx param
files. In other words, values in PMIx and
PRRTE param files are written into the
environment, not overwriting any previously
existing envar. OMPI param files are then
read, and any params that correspond to
PRRTE and PMIx params are written into
the environment - without overwriting
any that already exist.
So if you have a param specified in a
PRRTE param file, and you also specify it
in the OMPI param file, the PRRTE param file
value will "win".
User beware - someone is bound to be very
unhappy.
This really needs to be documented as someone is bound to spend oodles of hours trying to understand why the value they specified somewhere isn't working. You should also note that we still have lost all info as to the source of a param's value - still no idea if/how one might recover that info.
It seems to me that all these issues arise from the fact that we persist at separating the MCA for the different layers. This also forces us to prepend all MCA with a repeated text pattern (in @rhc54 PR that's PRTE_MCA_*
for an average of 12 bytes per env), wasting space on the 32k environment limit.
I propose we do the opposite, MCA files are read once (by the different projects in the context of mpirun or batch plugin) and appended, or prepended, to a single environment variable. No wasted space for tagging the project, compact and clean, easy to print and all MCA-based frameworks have the same. The downside is a replication of unnecessary params in processes and projects that don't need them, but as long as there is no name collision (and we know that's not the case) the wasted space will be ridiculously small compared with our current memory footprint for all the MCA variables.
Didn't realize I had the power to close an issue over here with a commit in upstream PRRTE, so I'll reopen it.
I have no immediate objection to your proposal. My only initial concerns would be:
<EDITED> Scratch this one - we obviously just give the param the name of "XYZ_MCA" and the value is just the entire string. My bad. <EDITED>
mpi4py 's own nightly CI testing on GHA has been failing with ompi@main in the last three days. However, I understand that OMPI's own CI had not failed, otherwise you would not have merged in the regression.
These are the full logs from the last failed run: https://github.com/mpi4py/mpi4py-testing/actions/runs/11310080041/job/31454714552 This is the specific error, not the message
prte-rmaps-base:all-available-resources-used
in the output:On closer inspection, I noticed a difference between mpi4py and open-mpi configuration:
mpi4py configures oversubscription via
$HOME/.openmpi/mca-params.conf
https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml#L101Open MPI configures oversubscription in both
$HOME/.openmpi/mca-params.conf
and$HOME/.prte/mca-params.conf
https://github.com/open-mpi/ompi/blob/main/.github/workflows/ompi_mpi4py.yaml#L80Looks like something has changed recently, and oversubscription settings in
$HOME/.openmpi/mca-params.conf
are being ignored. Was this change intentional or is it a regression?