open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

prrte: advance sha to 30cadc6746 #12901

Closed hppritcha closed 2 weeks ago

hppritcha commented 2 weeks ago

also advance pmix sha to 4aea550f6f55 to pick up PR https://github.com/openpmix/openpmix/pull/3414

hppritcha commented 2 weeks ago

@dalcinl here you go!

hppritcha commented 2 weeks ago

hmm, something is borked about configuring prrte for some of the jenkins tests

configure:5174: *** Configuring PRRTE
configure:63521: checking if PMIx version is 4.0.0 or greater
configure:63538: gcc -c -O3 -DNDEBUG  -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -Wshadow -Werror-implicit-function-declaration -fno-strict-aliasing -pedantic -Wall -Wformat-truncation=0 -finline-functions -mcx16 -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/include -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/include -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/ -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/ conftest.c >&5
conftest.c:526:1: warning: function declaration isn't a prototype [-Wstrict-prototypes]
 main ()
 ^~~~
configure:63538: $? = 0
configure:63539: result: yes
configure:63614: ===== configuring 3rd-party/prrte =====
configure:63804: running /bin/sh ./configure --disable-option-checking '--prefix=/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/install' --enable-prte-ft --with-proxy-version-string=5.1.0a1 --with-proxy-package-name="Open MPI" --with-proxy-bugreport="https://www.open-mpi.org/community/help/" --disable-devel-check --enable-prte-prefix-by-default --disable-pmix-lib-checks --with-pmix-extra-libs="/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/src/libpmix.la" 'CPPFLAGS= -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/include -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/include -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/ -I/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/3rd-party/openpmix/' --cache-file=/dev/null --srcdir=.
configure:63824: ===== done with 3rd-party/prrte configure =====
configure:63847: error: PRRTE configuration failed.  Cannot continue.
rhc54 commented 2 weeks ago

FWIW: on my PR, it kept complaining about not finding a valid PMIx build. Seemed like some issue with bringing down the PMIx submodule.

hppritcha commented 2 weeks ago

need to figure out what got borked in our prrte fork (we are careful about taking upstream commits but maybe not encough?) before advancing the sha @dalcinl

rhc54 commented 2 weeks ago

@hppritcha I don't believe that is the problem, though I could be wrong. When I tried to check OMPI main against head of upstream master branches, the problem I hit (which looked like the one you have here) was that Amazon kept failing to build the PR because PRRTE couldn't find a valid PMIx installation. Never was able to trace down a reason - looked/felt like Amazon simply couldn't download the PMIx submodule, but I'm not clear as to why that wouldn't have aborted the CI right then. Note that all the other CIs have no problem building it, so it is something unique about the Amazon Jenkins one.

Not sure of the reason - and I'm tied up for the next week. Just noting that it may not have anything to do with the PRRTE code.

hppritcha commented 2 weeks ago

okay now move to a suspicious (in terms on jenkins ci) sha

hppritcha commented 2 weeks ago

okay the problem is the hwloc jenkins CI is using is too old. configure message isn't very clear though. Looks like updating openpmix submodule may help with that.

rhc54 commented 2 weeks ago

okay the problem is the hwloc jenkins CI is using is too old. configure message isn't very clear though. Looks like updating openpmix submodule may help with that.

Per discussion with OMPI rms, we raised the minimum hwloc version to 2.1

hppritcha commented 2 weeks ago

Our configury isn't very small about failing if PMIx fails to configure, it just trundles on:

configure: WARNING: PMIx requires HWLOC v2.1.0 or above.
configure: error: Please select a supported version and configure again
configure: ===== done with 3rd-party/openpmix configure =====
checking for pmix pkg-config name... pmix
checking if pmix pkg-config module exists... no
checking for pmix wrapper compiler... pmixcc
checking if pmix wrapper compiler works... no
configure: Searching for pmix in default search paths
checking for pmix cppflags... 
checking for pmix ldflags... 
checking for pmix libs... -lpmix
checking for pmix static libs... -lpmix
checking pmix.h usability... no
checking pmix.h presence... no
checking for pmix.h... no
configure: error: Could not find viable pmix build.
+ echo './configure --prefix="/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/install"  --disable-silent-rules failed, ABORTING !'
./configure --prefix="/home/ec2-user/workspace/pen-mpi.pull_request-v2_PR-12901/install"  --disable-silent-rules failed, ABORTING !
+ test -f config.log
+ echo 'config.log content :'
config.log content :
rhc54 commented 2 weeks ago

Yeah that really confused me - had me chasing my tail 😗

hppritcha commented 2 weeks ago

I notice that the way the CI scripts work, if there's a configure failure at some point rather than just stopping the entire config.log is echo'd. This can make finding the actual configure failure a bit tricky to find in some cases.

rhc54 commented 2 weeks ago

@hppritcha I think what's confusing here is that OMPI's configure somehow continues on after the configure in PMIx generates an error due to seeing an HWLOC version that is below the minimum required. I'm not sure how/why the AC_MSG_ERROR is failing to stop the entire process, yet somehow we continue and go on to the PRRTE configure code.

Looking at the autoconf documentation for that macro, I do see this caution:

The error-description should start with a lower-case letter, and “cannot” is preferred to “can't”. 

which we violate on nearly all uses of that macro. It's the only AC_MSG_ macro with that caution - no idea why. Quick test shows that PMIx configure does correctly exit with a non-zero status when HWLOC is too old, so I'm not sure I understand the problem here. Might be worth someone exploring?

hppritcha commented 2 weeks ago

its a problem with the way the jenkins CI build script handles errors. Like I said above it just starts echoing all the logs rather than just exiting itself.

If I run by hand the behavior is as one would expect. configure dies with appropriate error message.