Open dxyzx0 opened 1 year ago
@jjhursey @awlauria Your area - any thoughts?
@dxyzx0 Is the cited git hash at or near the HEAD of the main
branch?
EDIT: Oops, I see your reference to "main" later in the text. Can you attach the config.log
file from the failed run?
From the output posted it looks like the library found didn't have the symbol needed. The check for ls_info
is here.
The config.log
should show the error message and compile string that it tried.
If you know where the LSF libraries are located you can try specifying the exact path to --with-lsf
and maybe also --with-lsf-libdir
(the lib
dir is not always at the same level as the other LSF binaries in bin
depending on the installation).
Did the lsf configure logic change between v4.1 and v5.0? Perhaps there is a typo in the main/v5.0 branches.
Well, checking prte, the main and v3 logic hasn't changed in 7 months: https://github.com/openpmix/prrte/blob/master/config/prte_check_lsf.m4
@dxyzx0 Is the cited git hash at or near the HEAD of the
main
branch?EDIT: Oops, I see your reference to "main" later in the text. Can you attach the
config.log
file from the failed run?
@jsquyres I upload the config_failed_for_main.log
in the post.
From the output posted it looks like the library found didn't have the symbol needed. The check for
ls_info
is here.The
config.log
should show the error message and compile string that it tried.If you know where the LSF libraries are located you can try specifying the exact path to
--with-lsf
and maybe also--with-lsf-libdir
(thelib
dir is not always at the same level as the other LSF binaries inbin
depending on the installation).
@jjhursey I have tried to set the --with-lsf=path_to_include
and --with-lsf-libdir=path_to_lib
in the main branch. In 4.1.4, I can pass the check just with --with-lsf
.
I think the problem maybe lies in this line
checking for lsf pkg-config name... /home/abc/opt/lib/lsf/pkgconfig/lsf.pc
checking if lsf pkg-config module exists... no
Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this lsf.pc
file in my lsf installation folder.
I also upload the succeeded config_4.1.4.log
in the post.
Well, checking pate, the main and v3 logic hasn't changed in 7 months
Quite true (assuming you meant "v5" instead of "v3")- but I believe it has changed significantly from your v4 series.
Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this lsf.pc file in my lsf installation folder.
It didn't find the package config file, but the configure logic did go ahead using your provided settings. It found the required header files, but failed to find the expected function symbol in the library. I think that is the heart of the problem, but I leave it to the IBM folks to resolve.
Well, checking pate, the main and v3 logic hasn't changed in 7 months
Quite true (assuming you meant "v5" instead of "v3")- but I believe it has changed significantly from your v4 series.
Well I meant prrte v3.0, which would be ompi v5 yes.
Removed the blocker label since there is a documented work-around:
@dxyzx0 Is the cited git hash at or near the HEAD of the
main
branch? EDIT: Oops, I see your reference to "main" later in the text. Can you attach theconfig.log
file from the failed run?@jsquyres I upload the
config_failed_for_main.log
in the post.From the output posted it looks like the library found didn't have the symbol needed. The check for
ls_info
is here. Theconfig.log
should show the error message and compile string that it tried. If you know where the LSF libraries are located you can try specifying the exact path to--with-lsf
and maybe also--with-lsf-libdir
(thelib
dir is not always at the same level as the other LSF binaries inbin
depending on the installation).@jjhursey I have tried to set the
--with-lsf=path_to_include
and--with-lsf-libdir=path_to_lib
in the main branch. In 4.1.4, I can pass the check just with--with-lsf
. I think the problem maybe lies in this linechecking for lsf pkg-config name... /home/abc/opt/lib/lsf/pkgconfig/lsf.pc checking if lsf pkg-config module exists... no
Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this
lsf.pc
file in my lsf installation folder. I also upload the succeededconfig_4.1.4.log
in the post.
@dxyzx0 just to confirm, did using --with-lsf=path_to_include
and --with-lsf-libdir=path_to_lib
work-around your issue with main?
@awlauria No. Still not work. In config_failed_for_main.log
, you can check I set these two parameters but still not work.
Thanks - adding the blocker label back.
@nysal do you have cycles to take a look at this?
I have some time to look at this. @dxyzx0 The config.log was truncated, unfortunately. Can you send the 3rd-party/prrte/config.log
since that's where it'll check for LSF support?
It'll probably fail around this line - I'm looking for the compile line that follows this section and the error messages that follow (I don't need the program it generated). If you want to post back that section, that'll be fine enough for me to see where and how it is failing.
configure:3529: --- MCA component ess:lsf (m4 configuration macro)
configure:25747: checking for MCA component ess:lsf compile mode
@jjhursey Here is the config.log
for prrte.
config_for_prrte.log
Thanks for the PRRTE config.log
. Reviewing the file shows:
configure:26886: checking for ls_info
configure:26886: gcc -o conftest -O3 -DNDEBUG -finline-functions -pthread -I/home/abc/git_repos/ompi/build/3rd-party/libevent-
2.1.12-stable -I/home/abc/git_repos/ompi/build/3rd-party/libevent-2.1.12-stable/include -I/home/abc/git_repos/ompi/build/3rd-par
ty/openpmix/include -I/home/abc/git_repos/ompi/3rd-party/openpmix/include -I/home/abc/git_repos/ompi/build/3rd-party/openpmix/ -
I/home/abc/git_repos/ompi/3rd-party/openpmix/ -I/home/abc/opt/include -L/home/abc/opt/lib/lsf conftest.c -lrt -lnsl -lm -lls
f -lnsl -lrt >&5
/home/abc/opt/lib/lsf/liblsf.so: undefined reference to `pow'
/home/abc/opt/lib/lsf/liblsf.so: undefined reference to `floor'
collect2: error: ld returned 1 exit status
configure:26886: $? = 1
configure: failed program was:
Looking at the configure logic it did the right thing marking ls_info
as failing which is critical. It went on to check to see if the problem was libevent
(which it wasn't) before failing that portion of the configure. Note that the configure logic is the same in v4.1.4.
pow
and floor
are part of the math library which I see is already added to the link line (-lm
). So you will need to track down what's going on with those undefined symbols.
My suggestion is to take the failed program from the config.log output, and run that gcc
command over the file. Then try to figure out why the match library is not being picked up. It could be a command line ordering issue (try putting the -lm
towards the front of the compile string - maybe near the -pthread
argument).
I would also search for that same check in the v4.1.4
config.log that you have to see if it is different than the one from main
.
FYI: I built Open MPI main
on a local machine, and it's compile string is similar to yours for that check. It built fine.
@dxyzx0 Just checking in to see if you made any progress on resolving this issue on your system or if you need further assistance.
@dxyzx0 were you able to make progress on this?
@jjhursey @awlauria I solved some of the problems including the ones mentioned in this comment by updating all the binutils. I'm working on Ubuntu 16.04, which leads to old gcc
and ld
.
But there're more errors coming:
checking for MCA component ess:lsf compile mode... static
configure: Setting LSF includedir to /nfsshare/lsf10.1/10.1
configure: Setting LSF libdir to /nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib
checking for library containing yp_all... no
configure: WARNING: Could not find yp_all. Please see https://github.com/openpmix/prrte/wiki/Building-LSF-support for more details.
checking for libevent conflict... No conflict found. -levent is not being explicitly used.
configure: WARNING: LSF support requested (via --with-lsf) but not found.
configure: error: Aborting.
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed. Cannot continue.
I upload the prrte's config file. prrte_config.log
It looks like the compilation error is looking for a -lsun
library on your system. I don't see a reference to that from our build system, so it must be trying to pick it up from some other dependency.
configure:25861: gcc -o conftest -O3 -DNDEBUG -finline-functions -pthread -I/nfsshare/home/pushanwen/git_repos/ompi/build/3rd-party/libeve
nt-2.1.12-stable -I/nfsshare/home/pushanwen/git_repos/ompi/build/3rd-party/libevent-2.1.12-stable/include -I/nfsshare/home/pushanwen/git_rep
os/ompi/build/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/pushanw
en/git_repos/ompi/build/3rd-party/openpmix/include -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/openpmix/include -I/nfsshare/home/pus
hanwen/git_repos/ompi/build/3rd-party/openpmix/ -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/openpmix/ conftest.c -lsun -lm >&5
/nfsshare/home/pushanwen/anaconda3/envs/cmake/bin/../lib/gcc/x86_64-conda-linux-gnu/12.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lsun: No such file or directory
collect2: error: ld returned 1 exit status
I would try to figure out where that -lsun
came from and/or where to get that library.
Any advice on which part of the log to investigate?
I'm also confused by this sun
library, but I googled and found nothing.
It looks like the sun
library is misleading after looking at the PRRTE configure logic here - which might be trying lsun
to find the yp_all
function.
Can you try to install (or see where it might be installed already) the libnsl
library (libnsl.so
). Once that's in your LD_LIBRARY_PATH then that check should fix itself (I hope at least).
Unfortunately no.
--- MCA component ess:lsf (m4 configuration macro)
checking for MCA component ess:lsf compile mode... static
configure: Setting LSF includedir to /nfsshare/lsf10.1/10.1
configure: Setting LSF libdir to /nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib
checking for library containing yp_all... no
configure: WARNING: Could not find yp_all. Please see https://github.com/openpmix/prrte/wiki/Building-LSF-support for more details.
checking for libevent conflict... No conflict found. -levent is not being explicitly used.
configure: WARNING: LSF support requested (via --with-lsf) but not found.
configure: error: Aborting.
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed. Cannot continue.
(cmake) abc@alpha02:~/ompi$ ldconfig -p | grep nsl
libnsl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib/x86_64-linux-gnu/libnsl.so.1
libnsl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /usr/lib/x86_64-linux-gnu/libnsl.so
(cmake) abc@alpha02:~/ompi$
@dxyzx0 Can you send your config.log
again (from after you installed libnsl.so
)?
That should tell us if configure
found libnsl.so
, but was unable to find the symbol yp_all
in it.
@dxyzx0 depending on the os - you may need libnsl, libnsl2 AND libnsl2-devel.
I don't know if this is the case with Ubuntu - but it is the case with RHEL. Something to check to see if the above are installed on your machine.
Just curious: is this a functioning LSF installation? Just a tad disturbing if so as that implies the configure logic fails when it really shouldn't. If there is something wrong with the installation, then the fact that our configure has problems makes sense and is "acceptable" in that we aren't claiming to work in broken environments.
@rhc54 I don't know if this cluster has a correctly-functioning LSF installation or not, but in the initial description of this issue, and from the config.log
files provided, configure
was invoked --with-lsf
. Hence, configure
is correct to fail if the LSF component can't be configured correctly.
Just for the record / for anyone who ends up here via Google: if --with-lsf
was not specified on the command line and the LSF component could not be configured correctly, it would be ignored, and configure
would continue.
I fear I am not communicating clearly - let me try again. We are now asking this user to install more software packages so that configure can perhaps find what it seeks. If this is a properly working LSF installation, then that feels like the wrong approach - if LSF can work as installed, then we should detect it as installed.
On the other hand, if the user is just pointing us at an LSF package they downloaded, but is not operational, then it is entirely possible that they failed to download all its requirements - and so it is fine that we help identify the missing pieces so that LSF can work, and then our configure can pass.
Does that make sense?
@rhc54 Gotcha.
@rhc54 @jsquyres It' a valid LSF installation. It's a LSF cluster of my university and I have submitted and finished many jobs. Here's the info from bacct -u abc
Accounting information about jobs that are:
- submitted by users abc,
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to all queues.
- accounted on all service classes.
------------------------------------------------------------------------------
SUMMARY: ( time unit: second )
Total number of done jobs: 10304 Total number of exited jobs: 21206
Total CPU time consumed: 155969139.8 Average CPU time consumed: 4949.8
Maximum CPU time of a job: 125925304.0 Minimum CPU time of a job: 0.0
Total wait time in queues: 57124640.0
Average wait time in queue: 1812.9
Maximum wait time in queue:82758.0 Minimum wait time in queue: 0.0
Average turnaround time: 2208 (seconds/job)
Maximum turnaround time: 6541481 Minimum turnaround time: 0
Average hog factor of a job: 0.23 ( cpu time / turnaround time )
Maximum hog factor of a job: 55.14 Minimum hog factor of a job: 0.00
Average expansion factor of a job: 50.68 ( turnaround time / run time )
Maximum expansion factor of a job: 82758.00
Minimum expansion factor of a job: 0.00
Total Run time consumed: 12460049 Average Run time consumed: 395
Maximum Run time of a job: 6541480 Minimum Run time of a job: 0
Total throughput: 1.76 (jobs/hour) during17948.63 hours
Beginning time: Jul 30 18:02 Ending time: Aug 17 14:39
@dxyzx0 Can you send your
config.log
again (from after you installedlibnsl.so
)?That should tell us if
configure
foundlibnsl.so
, but was unable to find the symbolyp_all
in it.
config_ompi.log config_prrte.log ]()
@dxyzx0 depending on the os - you may need
libnsl, libnsl2 AND libnsl2-devel.
I don't know if this is the case with Ubuntu - but it is the case with RHEL. Something to check to see if the above are installed on your machine.
I have the libnsl
as show above.
And I install the libnsl2
from source. But I don't know how to check if they have header files.
From config_prrte.log, it didn't find the nsl library:
configure:30436: gcc -o conftest -O3 -DNDEBUG -finline-functions -pthread -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/libevent-2.1.12-stable -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/libevent-2.1.12-stable/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/openpmix/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/openpmix/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/openpmix/ -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/openpmix/ conftest.c -lnsl -lm >&5
/nfsshare/home/abc/anaconda3/envs/cmake/bin/../lib/gcc/x86_64-conda-linux-gnu/12.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lnsl: No such file or directory
collect2: error: ld returned 1 exit status
configure:30436: $? = 1
Is libnsl.so
not in your linker's default search path?
Just for grins, I grep'd PRRTE and found that the only usage of yp_all
lies in our LSF configure logic, which states:
62: # liblsf requires yp_all, yp_get_default_domain, and ypprot_err
67: # on RHEL: libnsl, libnsl2 AND libnsl2-devel are required to link libnsl to get yp_all.
My question is: if that were actually true, then how is this an operational LSF environment without any of these being installed? In other words, why are we testing for yp_all
if a valid operational LSF environment does not actually require it? Should our configure logic be looking for something else, especially given that we don't use yp_all
in PRRTE itself?
I was chatting with @jjhursey the other day about this; this particular configure test originates all the way back in ORTE: 01e62b1994d1c79b2f83da8cce2faa24faeb6042.
Perhaps that comment is now way out of date:
If I remember correctly, @rhc54 , you can't even build a stand-alone LSF app without yp_all
defined, let alone ompi, even if the app doesn't call it:
[awlauria@c656f7n06 ~]$ gcc submit.c -I/nfs_smpi_ci/LSF_HOME/10.1/include -L/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib -llsf
/tmp/ccrJrG62.o: In function `main':
submit.c:(.text+0x24): undefined reference to `lsb_submit'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `yp_get_default_domain'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `yp_all'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `ypprot_err'
Ah, ok - so these things must be there. We just aren't getting the correct linkages setup, either because they aren't in the user's default library path or because the path(s) aren't being specified somehow/where.
Does liblsf no longer requires the yp_all...? This is a question that IBM should answer. 😄
Absolutely agree - I'm just poking because (a) this is at base a PRRTE issue and (b) it felt like we were trying to alter the system to make PRRTE's configure pass, instead of altering configure to pass within this system. Just wanted to ensure we didn't fall down that hole blindly.
@rhc54 Understood. I think @awlauria is confirming that liblsf
needs libnsl
. FWIW, I see in https://github.com/thkukuk/libnsl that yp_all
is still used in the code base. So I think we have multiple confirmations here.
So I think I re-confirm my question from https://github.com/open-mpi/ompi/issues/10943#issuecomment-1307620831 (and basically what @rhc54 just asked): Is libnsl.so
not in your linker's default search path?
(cmake) abc@alpha02:~/test$ ld -lnsl
ld: cannot find -lnsl: No such file or directory
You're right, libnsl
is not in my path.
But how can I add them to the search_dir ?
I tried to add the path to LIBRARY_PATH
but not work.
(cmake) abc@alpha02:~/test$ echo $LIBRARY_PATH
/nfsshare/home/abc/opt/lib:/nfsshare/home/abc/opt/lib::/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/usr/lib/x86_64-linux-gnu
(cmake) abc@alpha02:~/test$ ls ~/opt/lib/libnsl.*
/nfsshare/home/abc/opt/lib/libnsl.a /nfsshare/home/abc/opt/lib/libnsl.so /nfsshare/home/abc/opt/lib/libnsl.so.3.0.0
/nfsshare/home/abc/opt/lib/libnsl.la /nfsshare/home/abc/opt/lib/libnsl.so.3
Were you using ld -lnsl
to see if libnsl.so
is in your linker search path? If so, I don't think that's the right command. You can look at the output of ldconfig -v -C /tmp/bogus
, for example (and remove the temporary file /tmp/bogus
that it creates afterwards) to see if libnsl.so*
is somewhere in the search path.
If you need /nfsshare/home/abc/opt/lib
to be in the search path, you probably need to add it to the LD_LIBRARY_PATH
environment variable, not LIBRARY_PATH
. Make sure to do this before you invoke Open MPI's configure
, and before any Open MPI command (such as mpirun
).
Is there an update on this - should this be closed? This looks like a configuration issue, removing from the v5.0.0 blocking list unless someone feels strongly otherwise.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
a992820
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
configure from git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.9f297ced5093868a4d3138ece0df1822348531dd 3rd-party/openpmix (v1.1.3-3648-g9f297ce) 004d0e8f52bcdafe124eb56f10c2e6c3430cfe7f 3rd-party/prrte (psrvr-v2.0.0rc1-4492-g004d0e8) +51f3f7de884049c880f45144ae4a63eb6f66f4e4 config/oac (heads/main)
Please describe the system on which you are running
Details of the problem
I try to configure with '--with-lsf' option but failed in the master branch or 5.0.0rc8 tarball, but succeeded in 4.1.4 tarball. My command:
The error log in master branch or 5.0.0rc8:
But I succeed in 4.1.4.
Upload the
config_failed_for_main.log
as required in https://github.com/open-mpi/ompi/issues/10943#issuecomment-1282566587 config_failed_for_main.logI also upload the successful log: config_4.1.4.log