open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

mip4py: Regression in GitHub Actions #12857

Open dalcinl opened 1 week ago

dalcinl commented 1 week ago

mpi4py 's own nightly CI testing on GHA has been failing with ompi@main in the last three days. However, I understand that OMPI's own CI had not failed, otherwise you would not have merged in the regression.

These are the full logs from the last failed run: https://github.com/mpi4py/mpi4py-testing/actions/runs/11310080041/job/31454714552 This is the specific error, not the message prte-rmaps-base:all-available-resources-used in the output:

test_async_error_callback (test_util_pool.TestProcessPool.test_async_error_callback) ... 1 more process has sent help message help-prte-rmaps-base.txt / prte-rmaps-base:all-available-resources-used
Exception in thread Thread-7 (_manager_spawn):
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/home/runner/work/mpi4py-testing/mpi4py-testing/build/lib.linux-x86_64-cpython-312/mpi4py/futures/_core.py", line 350, in _manager_spawn
    comm = serialized(client_spawn)(pyexe, pyargs, nprocs, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/mpi4py-testing/mpi4py-testing/build/lib.linux-x86_64-cpython-312/mpi4py/futures/_core.py", line 1058, in client_spawn
    comm = MPI.COMM_SELF.Spawn(python_exe, args, max_workers, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI.src/Comm.pyx", line 2544, in mpi4py.MPI.Intracomm.Spawn
    with nogil: CHKERR( MPI_Comm_spawn(
mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error

On closer inspection, I noticed a difference between mpi4py and open-mpi configuration:

mpi4py configures oversubscription via $HOME/.openmpi/mca-params.conf https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml#L101

Open MPI configures oversubscription in both $HOME/.openmpi/mca-params.conf and $HOME/.prte/mca-params.conf https://github.com/open-mpi/ompi/blob/main/.github/workflows/ompi_mpi4py.yaml#L80

Looks like something has changed recently, and oversubscription settings in $HOME/.openmpi/mca-params.conf are being ignored. Was this change intentional or is it a regression?

dalcinl commented 1 week ago

@hppritcha Maybe this issue comes from #12853?

hppritcha commented 1 week ago

Highly likely that's the case.

rhc54 commented 1 week ago

Unsure exactly what was included in that update, but I was immediately suspicious when that PR was filed because it did not include the corresponding update to the PMIx submodule. As previously noted, the two are tightly coupled - so you should never update one without updating the other.

Might not be the cause here, but it is the first thing I would try.

dalcinl commented 6 days ago

@hppritcha Please let me know once you open a PR. I would like to suggest to update the configuration of of oversubscription in this repo, such that it is in sync with what I do in mpi4py. I assume that OMPI wants all configuration in $HOME/.openmpi/mca-params.conf honored, including that of PRRTE. Am I right?

hppritcha commented 5 days ago

Hmm. well actually this is not a bug. See https://docs.open-mpi.org/en/v5.0.x/mca.html#mca-parameter-changes-between-open-mpi-4-x-and-5-x . You'll need to add rmaps_default_mapping_policy = :oversubscribe to your $HOME/.prte/mca-params.conf, at least for v5.0.x and current main.

The ompi mpi4py github action is kind of misleading here so i'll fix it to not use that deprecated mca param in the $HOME/.openmpi/mca-params.conf file.

rhc54 commented 5 days ago

@hppritcha I suppose this opens the question I noted on your PR. We could register synonyms for the OMPI v4 form of the PRRTE params. Do we want to do so? Be a little bit of grunge work, but nothing all that bad. May not be possible for all params as they may not have a corresponding OMPI v5 version - or the values may conflict so that PRRTE doesn't understand them. But we could pick up a number of them, and certainly the "oversubscribe" one should be doable.

I've added a check to ensure we are actually processing the OMPI default param files - pretty sure we are, but I'll check. Just need to finish something else first.

dalcinl commented 4 days ago

So the OMPI documentation @hppritcha pointed to warns use to use THREE different locations to configure Open MPI operation. Are you guys sure you want to put that complexity on users?

I must add a clarification: the purpose of the line setting rmaps_base_oversubscribe = true is just to be able to test against branch v4.0.x, I know that setting is gone in v5.0.x and main, and that's the reason I also set maps_default_mapping_policy = :oversubscribe, but all that in the single config file $HOME/.openmpi/mca-params.conf.

Until a few days ago, oversubscription seemed to work with all branches, and as of today it is still working in v5.0.x. From what @rhc54 said, I understand OMPI from v5.0.x and main simply ignore the old v4.x setting (is that correct?). Therefore it looks to me that v5.0.x is indeed reading the oversubscription setting from $HOME/.openmpi/mca-params.conf (otherwise my tests would fail), while the main branch lost such ability. Is my diagnosis correct?

dalcinl commented 4 days ago

Guys, I have a somewhat explicit question. Are PRRTE and PMIx (when used via/with Open MPI) supposed to read configuration from OMPI's config file $HOME/.openmpi/mca-params.conf? If that is the case, then this issue looks like a regression. If that is not the case, then I do not understand why mpi4py CI is not failing with ompi@v5.0.x due to oversubscription not being turned on (as mpi4py IS NOT using $HOME/.prte/mca-params.conf`).

rhc54 commented 4 days ago

I have said several times that I will investigate it - just not a high priority at the moment when compared with other things.

hppritcha commented 4 days ago

I see @dalcinl . i'll take a further look at what may have caused this change in behavior on our main branch.

rhc54 commented 4 days ago

So here is the problem. The change came about because of an OMPI reported issue regarding passing of MCA params. We got that fixed. However, what it exposed was that we were overlaying processing of OMPI, PRRTE, and PMIx params on top of each other, thereby causing confusion.

Long story short: the issue here is that mpirun doesn't know it is working with an OMPI app until after we have already processed the PRRTE and PMIx params, and after we have already initialized much of PRRTE (opened frameworks, etc). So it is far too late in the process to impact PRRTE settings.

This is why we have three param files - to ensure that the settings are applied before anything happens in their respective projects. I personally don't see a simple way around that one - I suppose you could add param file parsing in OMPI's mpirun code before it exec's prterun, though that might get a little klunky as you'd have to track mapping of PRRTE and PMIx param names (e.g., to recognize that "rmaps_base_default_mapping_policy" requires conversion to "PRTE_MCA_rmaps_base_default_mapping_policy).

bosilca commented 4 days ago

From a user perspective this is mind blowingly complicated and cumbersome. And I'm trying to be polite.

rhc54 commented 4 days ago

I actually agree - just waiting for someone to propose a realistic solution. Frankly, the whole MCA param thing has been a considerable source of confusion to users over the years - way too many params, far too detailed knowledge of the code base required to really use them. Adding knowledge of which ones to set where does increase that burden.

To recap the problem a bit, in case someone does want to come up with a proposal, imagine the case where someone puts a PRRTE-related value in the OMPI param file, and still has that same param in the PRRTE param file. What precedence should be applied to determining the final value? Which one "wins"?

We could collapse to a single param file, passing its name in the environment to override the project default (e.g., modify OMPI's "mpirun" to pass the name/path to the OMPI param file) - but we'd have to use that name to override looking for any of the other files. Not sure how confusing that will get. If I put something in a PMIx param file, then suddenly that setting disappears because I ran OMPI's "mpirun"?

So I'm open to thoughts - just unsure we really have a "right" answer here.

hppritcha commented 4 days ago

https://github.com/open-mpi/prrte/issues/17

rhc54 commented 4 days ago

Afraid that won't work @hppritcha - PRRTE uses the PMIx MCA base. You would have to convert PRRTE to use the OPAL MCA base, which means pulling PRRTE completely inside OMPI ala ORTE...which forces a near-complete rewrite of PRRTE (e.g., every framework would need redo, you'd have to convert to the OPAL object system, etc.). At that point, you might as well go back to ORTE and start from there.

hppritcha commented 4 days ago

git bisect and code inspection shows that the behavior was changed when we merged the commit from this PR https://github.com/openpmix/prrte/pull/1999 into our fork of prrte.

Not sure whether to consider this a bug or a feature at this point. I guess at a minimum we should document the need to put prrte mca params in the prte/mca-params.conf when not using env. variables or mpirun command line arguments.

rhc54 commented 4 days ago

Yeah, the problem lies in the required conversion - need to identify that a param actually belongs to PRRTE so you can correctly prefix it with PRTE_MCA_ so PRRTE will pick it up. In this case, you need to read the OMPI param file(s) and perform the translation on anything found there.

However, PRRTE doesn't know we are using the OMPI personality, and therefore doesn't know we should read the OMPI param file(s), until after we have populated the MCA system. I can investigate perhaps doing another parse at time of selecting the ompi/schizo component - might be in time for some things, but not early enough for some params to have already taken effect. Have to see how the MCA system would deal with duplicate parsing.

hppritcha commented 4 days ago

Hmm... well our prte could be made to know its doing ompi personality.

rhc54 commented 4 days ago

I think you'll find things are more complicated than you realize if/when you try! However, you may find it a valuable learning experience.

I have fixed this issue in the referenced PR. As noted in the commit comment, it isn't a perfect solution - however, we do identify the OMPI personality fairly early in the init process, and so most (maybe the vast majority?) of params of interest to users will be caught by it.

Per the commit message:

This still raises precedence questions. For
now, what this will do is have values in the
OMPI param files give way to corresponding
values specified in PRRTE and PMIx param
files. In other words, values in PMIx and
PRRTE param files are written into the
environment, not overwriting any previously
existing envar. OMPI param files are then
read, and any params that correspond to
PRRTE and PMIx params are written into
the environment - without overwriting
any that already exist.

So if you have a param specified in a
PRRTE param file, and you also specify it
in the OMPI param file, the PRRTE param file
value will "win".

User beware - someone is bound to be very
unhappy.

This really needs to be documented as someone is bound to spend oodles of hours trying to understand why the value they specified somewhere isn't working. You should also note that we still have lost all info as to the source of a param's value - still no idea if/how one might recover that info.

bosilca commented 3 days ago

It seems to me that all these issues arise from the fact that we persist at separating the MCA for the different layers. This also forces us to prepend all MCA with a repeated text pattern (in @rhc54 PR that's PRTE_MCA_* for an average of 12 bytes per env), wasting space on the 32k environment limit.

I propose we do the opposite, MCA files are read once (by the different projects in the context of mpirun or batch plugin) and appended, or prepended, to a single environment variable. No wasted space for tagging the project, compact and clean, easy to print and all MCA-based frameworks have the same. The downside is a replication of unnecessary params in processes and projects that don't need them, but as long as there is no name collision (and we know that's not the case) the wasted space will be ridiculously small compared with our current memory footprint for all the MCA variables.

rhc54 commented 3 days ago

Didn't realize I had the power to close an issue over here with a commit in upstream PRRTE, so I'll reopen it.

I have no immediate objection to your proposal. My only initial concerns would be:

  1. with no prefix, how do we avoid collisions with envars from other projects or set by users? In other words, if we create an envar that looks like "btl_tcp_foo:oob_base_bar" - what prevents us from colliding with someone's random variable? Or maybe we just prefix the entire string with a single "XYZ_MCA:"? <EDITED> Scratch this one - we obviously just give the param the name of "XYZ_MCA" and the value is just the entire string. My bad. <EDITED>
  2. still begs the problem of precedence as I parse this new param - when I find multiple occurrences of "rmaps_base_foo", which one do I use?
  3. in looking at the environment, I only generally see about 10-15 project-related envars. So even though we have a lot of potential entries, we realistically only are using about 150 bytes for envar prefixes.