open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

./configure failed with option '--with-lsf' #10943

Open dxyzx0 opened 1 year ago

dxyzx0 commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

a992820

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

configure from git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

9f297ced5093868a4d3138ece0df1822348531dd 3rd-party/openpmix (v1.1.3-3648-g9f297ce) 004d0e8f52bcdafe124eb56f10c2e6c3430cfe7f 3rd-party/prrte (psrvr-v2.0.0rc1-4492-g004d0e8) +51f3f7de884049c880f45144ae4a63eb6f66f4e4 config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

I try to configure with '--with-lsf' option but failed in the master branch or 5.0.0rc8 tarball, but succeeded in 4.1.4 tarball. My command:

shell$ ../configure --with-lsf

The error log in master branch or 5.0.0rc8:

--- MCA component ess:lsf (m4 configuration macro)
checking for MCA component ess:lsf compile mode... static
checking for library containing yp_all... -lnsl
checking for library containing shm_open... -lrt
checking for lsf pkg-config name... /home/abc/opt/lib/lsf/pkgconfig/lsf.pc
checking if lsf pkg-config module exists... no
checking for lsf header at /home/abc/opt/include... found
checking for lsf library (lsf) in /home/abc/opt/lib/lsf... found
checking for lsf cppflags... -I/home/abc/opt/include
checking for lsf ldflags... -L/home/abc/opt/lib/lsf
checking for lsf libs... -llsf -lnsl -lrt
checking for lsf static libs... -llsf -lnsl -lrt
checking lsf/lsf.h usability... yes
checking lsf/lsf.h presence... yes
checking for lsf/lsf.h... yes
checking for ls_info... no
checking for libevent conflict... No conflict found. -levent is not being explicitly used.
configure: WARNING: LSF support requested (via --with-lsf) but not found.
configure: error: Aborting.
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed.  Cannot continue.

But I succeed in 4.1.4.

Upload the config_failed_for_main.log as required in https://github.com/open-mpi/ompi/issues/10943#issuecomment-1282566587 config_failed_for_main.log

I also upload the successful log: config_4.1.4.log

rhc54 commented 1 year ago

@jjhursey @awlauria Your area - any thoughts?

jsquyres commented 1 year ago

@dxyzx0 Is the cited git hash at or near the HEAD of the main branch?

EDIT: Oops, I see your reference to "main" later in the text. Can you attach the config.log file from the failed run?

jjhursey commented 1 year ago

From the output posted it looks like the library found didn't have the symbol needed. The check for ls_info is here.

The config.log should show the error message and compile string that it tried.

If you know where the LSF libraries are located you can try specifying the exact path to --with-lsf and maybe also --with-lsf-libdir (the lib dir is not always at the same level as the other LSF binaries in bin depending on the installation).

awlauria commented 1 year ago

Did the lsf configure logic change between v4.1 and v5.0? Perhaps there is a typo in the main/v5.0 branches.

awlauria commented 1 year ago

Well, checking prte, the main and v3 logic hasn't changed in 7 months: https://github.com/openpmix/prrte/blob/master/config/prte_check_lsf.m4

dxyzx0 commented 1 year ago

@dxyzx0 Is the cited git hash at or near the HEAD of the main branch?

EDIT: Oops, I see your reference to "main" later in the text. Can you attach the config.log file from the failed run?

@jsquyres I upload the config_failed_for_main.log in the post.

From the output posted it looks like the library found didn't have the symbol needed. The check for ls_info is here.

The config.log should show the error message and compile string that it tried.

If you know where the LSF libraries are located you can try specifying the exact path to --with-lsf and maybe also --with-lsf-libdir (the lib dir is not always at the same level as the other LSF binaries in bin depending on the installation).

@jjhursey I have tried to set the --with-lsf=path_to_include and --with-lsf-libdir=path_to_lib in the main branch. In 4.1.4, I can pass the check just with --with-lsf. I think the problem maybe lies in this line

checking for lsf pkg-config name... /home/abc/opt/lib/lsf/pkgconfig/lsf.pc
checking if lsf pkg-config module exists... no

Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this lsf.pc file in my lsf installation folder. I also upload the succeeded config_4.1.4.log in the post.

rhc54 commented 1 year ago

Well, checking pate, the main and v3 logic hasn't changed in 7 months

Quite true (assuming you meant "v5" instead of "v3")- but I believe it has changed significantly from your v4 series.

Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this lsf.pc file in my lsf installation folder.

It didn't find the package config file, but the configure logic did go ahead using your provided settings. It found the required header files, but failed to find the expected function symbol in the library. I think that is the heart of the problem, but I leave it to the IBM folks to resolve.

awlauria commented 1 year ago

Well, checking pate, the main and v3 logic hasn't changed in 7 months

Quite true (assuming you meant "v5" instead of "v3")- but I believe it has changed significantly from your v4 series.

Well I meant prrte v3.0, which would be ompi v5 yes.

awlauria commented 1 year ago

Removed the blocker label since there is a documented work-around:

@dxyzx0 Is the cited git hash at or near the HEAD of the main branch? EDIT: Oops, I see your reference to "main" later in the text. Can you attach the config.log file from the failed run?

@jsquyres I upload the config_failed_for_main.log in the post.

From the output posted it looks like the library found didn't have the symbol needed. The check for ls_info is here. The config.log should show the error message and compile string that it tried. If you know where the LSF libraries are located you can try specifying the exact path to --with-lsf and maybe also --with-lsf-libdir (the lib dir is not always at the same level as the other LSF binaries in bin depending on the installation).

@jjhursey I have tried to set the --with-lsf=path_to_include and --with-lsf-libdir=path_to_lib in the main branch. In 4.1.4, I can pass the check just with --with-lsf. I think the problem maybe lies in this line

checking for lsf pkg-config name... /home/abc/opt/lib/lsf/pkgconfig/lsf.pc
checking if lsf pkg-config module exists... no

Since, I compare the log with the succeeded one in 4.1.4, which doesn't have this check. And I can't find this lsf.pc file in my lsf installation folder. I also upload the succeeded config_4.1.4.log in the post.

@dxyzx0 just to confirm, did using --with-lsf=path_to_include and --with-lsf-libdir=path_to_lib work-around your issue with main?

dxyzx0 commented 1 year ago

@awlauria No. Still not work. In config_failed_for_main.log, you can check I set these two parameters but still not work.

awlauria commented 1 year ago

Thanks - adding the blocker label back.

awlauria commented 1 year ago

@nysal do you have cycles to take a look at this?

jjhursey commented 1 year ago

I have some time to look at this. @dxyzx0 The config.log was truncated, unfortunately. Can you send the 3rd-party/prrte/config.log since that's where it'll check for LSF support?

It'll probably fail around this line - I'm looking for the compile line that follows this section and the error messages that follow (I don't need the program it generated). If you want to post back that section, that'll be fine enough for me to see where and how it is failing.

configure:3529: --- MCA component ess:lsf (m4 configuration macro)
configure:25747: checking for MCA component ess:lsf compile mode
dxyzx0 commented 1 year ago

@jjhursey Here is the config.log for prrte. config_for_prrte.log

jjhursey commented 1 year ago

Thanks for the PRRTE config.log. Reviewing the file shows:

configure:26886: checking for ls_info
configure:26886: gcc -o conftest -O3 -DNDEBUG  -finline-functions -pthread  -I/home/abc/git_repos/ompi/build/3rd-party/libevent-
2.1.12-stable -I/home/abc/git_repos/ompi/build/3rd-party/libevent-2.1.12-stable/include -I/home/abc/git_repos/ompi/build/3rd-par
ty/openpmix/include -I/home/abc/git_repos/ompi/3rd-party/openpmix/include -I/home/abc/git_repos/ompi/build/3rd-party/openpmix/ -
I/home/abc/git_repos/ompi/3rd-party/openpmix/ -I/home/abc/opt/include   -L/home/abc/opt/lib/lsf conftest.c -lrt -lnsl -lm   -lls
f -lnsl -lrt >&5
/home/abc/opt/lib/lsf/liblsf.so: undefined reference to `pow'
/home/abc/opt/lib/lsf/liblsf.so: undefined reference to `floor'
collect2: error: ld returned 1 exit status
configure:26886: $? = 1
configure: failed program was:

Looking at the configure logic it did the right thing marking ls_info as failing which is critical. It went on to check to see if the problem was libevent (which it wasn't) before failing that portion of the configure. Note that the configure logic is the same in v4.1.4.

pow and floor are part of the math library which I see is already added to the link line (-lm). So you will need to track down what's going on with those undefined symbols.

My suggestion is to take the failed program from the config.log output, and run that gcc command over the file. Then try to figure out why the match library is not being picked up. It could be a command line ordering issue (try putting the -lm towards the front of the compile string - maybe near the -pthread argument).

I would also search for that same check in the v4.1.4 config.log that you have to see if it is different than the one from main.

FYI: I built Open MPI main on a local machine, and it's compile string is similar to yours for that check. It built fine.

jjhursey commented 1 year ago

@dxyzx0 Just checking in to see if you made any progress on resolving this issue on your system or if you need further assistance.

awlauria commented 1 year ago

@dxyzx0 were you able to make progress on this?

dxyzx0 commented 1 year ago

@jjhursey @awlauria I solved some of the problems including the ones mentioned in this comment by updating all the binutils. I'm working on Ubuntu 16.04, which leads to old gcc and ld. But there're more errors coming:

checking for MCA component ess:lsf compile mode... static
configure: Setting LSF includedir to /nfsshare/lsf10.1/10.1
configure: Setting LSF libdir to /nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib
checking for library containing yp_all... no
configure: WARNING: Could not find yp_all. Please see https://github.com/openpmix/prrte/wiki/Building-LSF-support for more details.
checking for libevent conflict... No conflict found. -levent is not being explicitly used.
configure: WARNING: LSF support requested (via --with-lsf) but not found.
configure: error: Aborting.
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed.  Cannot continue.

I upload the prrte's config file. prrte_config.log

jjhursey commented 1 year ago

It looks like the compilation error is looking for a -lsun library on your system. I don't see a reference to that from our build system, so it must be trying to pick it up from some other dependency.

configure:25861: gcc -o conftest -O3 -DNDEBUG  -finline-functions -pthread  -I/nfsshare/home/pushanwen/git_repos/ompi/build/3rd-party/libeve
nt-2.1.12-stable -I/nfsshare/home/pushanwen/git_repos/ompi/build/3rd-party/libevent-2.1.12-stable/include -I/nfsshare/home/pushanwen/git_rep
os/ompi/build/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/pushanw
en/git_repos/ompi/build/3rd-party/openpmix/include -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/openpmix/include -I/nfsshare/home/pus
hanwen/git_repos/ompi/build/3rd-party/openpmix/ -I/nfsshare/home/pushanwen/git_repos/ompi/3rd-party/openpmix/   conftest.c -lsun  -lm   >&5
/nfsshare/home/pushanwen/anaconda3/envs/cmake/bin/../lib/gcc/x86_64-conda-linux-gnu/12.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lsun: No such file or directory
collect2: error: ld returned 1 exit status

I would try to figure out where that -lsun came from and/or where to get that library.

dxyzx0 commented 1 year ago

Any advice on which part of the log to investigate? I'm also confused by this sun library, but I googled and found nothing.

jjhursey commented 1 year ago

It looks like the sun library is misleading after looking at the PRRTE configure logic here - which might be trying lsun to find the yp_all function.

Can you try to install (or see where it might be installed already) the libnsl library (libnsl.so). Once that's in your LD_LIBRARY_PATH then that check should fix itself (I hope at least).

dxyzx0 commented 1 year ago

Unfortunately no.

--- MCA component ess:lsf (m4 configuration macro)
checking for MCA component ess:lsf compile mode... static
configure: Setting LSF includedir to /nfsshare/lsf10.1/10.1
configure: Setting LSF libdir to /nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib
checking for library containing yp_all... no
configure: WARNING: Could not find yp_all. Please see https://github.com/openpmix/prrte/wiki/Building-LSF-support for more details.
checking for libevent conflict... No conflict found. -levent is not being explicitly used.
configure: WARNING: LSF support requested (via --with-lsf) but not found.
configure: error: Aborting.
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed.  Cannot continue.
(cmake) abc@alpha02:~/ompi$ ldconfig -p | grep nsl
        libnsl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib/x86_64-linux-gnu/libnsl.so.1
        libnsl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /usr/lib/x86_64-linux-gnu/libnsl.so
(cmake) abc@alpha02:~/ompi$ 
jsquyres commented 1 year ago

@dxyzx0 Can you send your config.log again (from after you installed libnsl.so)?

That should tell us if configure found libnsl.so, but was unable to find the symbol yp_all in it.

awlauria commented 1 year ago

@dxyzx0 depending on the os - you may need libnsl, libnsl2 AND libnsl2-devel.

I don't know if this is the case with Ubuntu - but it is the case with RHEL. Something to check to see if the above are installed on your machine.

rhc54 commented 1 year ago

Just curious: is this a functioning LSF installation? Just a tad disturbing if so as that implies the configure logic fails when it really shouldn't. If there is something wrong with the installation, then the fact that our configure has problems makes sense and is "acceptable" in that we aren't claiming to work in broken environments.

jsquyres commented 1 year ago

@rhc54 I don't know if this cluster has a correctly-functioning LSF installation or not, but in the initial description of this issue, and from the config.log files provided, configure was invoked --with-lsf. Hence, configure is correct to fail if the LSF component can't be configured correctly.

Just for the record / for anyone who ends up here via Google: if --with-lsf was not specified on the command line and the LSF component could not be configured correctly, it would be ignored, and configure would continue.

rhc54 commented 1 year ago

I fear I am not communicating clearly - let me try again. We are now asking this user to install more software packages so that configure can perhaps find what it seeks. If this is a properly working LSF installation, then that feels like the wrong approach - if LSF can work as installed, then we should detect it as installed.

On the other hand, if the user is just pointing us at an LSF package they downloaded, but is not operational, then it is entirely possible that they failed to download all its requirements - and so it is fine that we help identify the missing pieces so that LSF can work, and then our configure can pass.

Does that make sense?

jsquyres commented 1 year ago

@rhc54 Gotcha.

dxyzx0 commented 1 year ago

@rhc54 @jsquyres It' a valid LSF installation. It's a LSF cluster of my university and I have submitted and finished many jobs. Here's the info from bacct -u abc

Accounting information about jobs that are: 
  - submitted by users abc, 
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second ) 
 Total number of done jobs:   10304      Total number of exited jobs: 21206
 Total CPU time consumed:   155969139.8      Average CPU time consumed:  4949.8
 Maximum CPU time of a job: 125925304.0      Minimum CPU time of a job:     0.0
 Total wait time in queues: 57124640.0
 Average wait time in queue: 1812.9
 Maximum wait time in queue:82758.0      Minimum wait time in queue:    0.0
 Average turnaround time:      2208 (seconds/job)
 Maximum turnaround time:   6541481      Minimum turnaround time:         0
 Average hog factor of a job:  0.23 ( cpu time / turnaround time )
 Maximum hog factor of a job:  55.14      Minimum hog factor of a job:  0.00
 Average expansion factor of a job:  50.68 ( turnaround time / run time )
 Maximum expansion factor of a job:  82758.00
 Minimum expansion factor of a job:  0.00
 Total Run time consumed:   12460049      Average Run time consumed:     395
 Maximum Run time of a job: 6541480      Minimum Run time of a job:       0
 Total throughput:             1.76 (jobs/hour)  during17948.63 hours
 Beginning time:       Jul 30 18:02      Ending time:          Aug 17 14:39

@dxyzx0 Can you send your config.log again (from after you installed libnsl.so)?

That should tell us if configure found libnsl.so, but was unable to find the symbol yp_all in it.

config_ompi.log config_prrte.log ]()

@dxyzx0 depending on the os - you may need libnsl, libnsl2 AND libnsl2-devel.

I don't know if this is the case with Ubuntu - but it is the case with RHEL. Something to check to see if the above are installed on your machine.

I have the libnsl as show above. And I install the libnsl2 from source. But I don't know how to check if they have header files.

jsquyres commented 1 year ago

From config_prrte.log, it didn't find the nsl library:

configure:30436: gcc -o conftest -O3 -DNDEBUG  -finline-functions -pthread  -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/libevent-2.1.12-stable -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/libevent-2.1.12-stable/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/hwloc-2.7.1/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/openpmix/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/openpmix/include -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/build/3rd-party/openpmix/ -I/nfsshare/home/abc/git_repos/openmpi-5.0.0rc9/3rd-party/openpmix/   conftest.c -lnsl  -lm   >&5
/nfsshare/home/abc/anaconda3/envs/cmake/bin/../lib/gcc/x86_64-conda-linux-gnu/12.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lnsl: No such file or directory
collect2: error: ld returned 1 exit status
configure:30436: $? = 1

Is libnsl.so not in your linker's default search path?

rhc54 commented 1 year ago

Just for grins, I grep'd PRRTE and found that the only usage of yp_all lies in our LSF configure logic, which states:

62:          # liblsf requires yp_all, yp_get_default_domain, and ypprot_err
67:          # on RHEL: libnsl, libnsl2 AND libnsl2-devel are required to link libnsl to get yp_all.

My question is: if that were actually true, then how is this an operational LSF environment without any of these being installed? In other words, why are we testing for yp_all if a valid operational LSF environment does not actually require it? Should our configure logic be looking for something else, especially given that we don't use yp_all in PRRTE itself?

jsquyres commented 1 year ago

I was chatting with @jjhursey the other day about this; this particular configure test originates all the way back in ORTE: 01e62b1994d1c79b2f83da8cce2faa24faeb6042.

Perhaps that comment is now way out of date:

awlauria commented 1 year ago

If I remember correctly, @rhc54 , you can't even build a stand-alone LSF app without yp_all defined, let alone ompi, even if the app doesn't call it:

[awlauria@c656f7n06 ~]$ gcc submit.c -I/nfs_smpi_ci/LSF_HOME/10.1/include -L/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib -llsf
/tmp/ccrJrG62.o: In function `main':
submit.c:(.text+0x24): undefined reference to `lsb_submit'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `yp_get_default_domain'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `yp_all'
/nfs_smpi_ci/LSF_HOME/10.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so: undefined reference to `ypprot_err'
rhc54 commented 1 year ago

Ah, ok - so these things must be there. We just aren't getting the correct linkages setup, either because they aren't in the user's default library path or because the path(s) aren't being specified somehow/where.

Does liblsf no longer requires the yp_all...? This is a question that IBM should answer. 😄

Absolutely agree - I'm just poking because (a) this is at base a PRRTE issue and (b) it felt like we were trying to alter the system to make PRRTE's configure pass, instead of altering configure to pass within this system. Just wanted to ensure we didn't fall down that hole blindly.

jsquyres commented 1 year ago

@rhc54 Understood. I think @awlauria is confirming that liblsf needs libnsl. FWIW, I see in https://github.com/thkukuk/libnsl that yp_all is still used in the code base. So I think we have multiple confirmations here.

So I think I re-confirm my question from https://github.com/open-mpi/ompi/issues/10943#issuecomment-1307620831 (and basically what @rhc54 just asked): Is libnsl.so not in your linker's default search path?

dxyzx0 commented 1 year ago
(cmake) abc@alpha02:~/test$ ld -lnsl
ld: cannot find -lnsl: No such file or directory

You're right, libnsl is not in my path. But how can I add them to the search_dir ? I tried to add the path to LIBRARY_PATH but not work.

(cmake) abc@alpha02:~/test$ echo $LIBRARY_PATH
/nfsshare/home/abc/opt/lib:/nfsshare/home/abc/opt/lib::/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/nfsshare/lsf10.1/10.1/linux2.6-glibc2.3-x86_64/lib:/usr/lib/x86_64-linux-gnu
(cmake) abc@alpha02:~/test$ ls ~/opt/lib/libnsl.*
/nfsshare/home/abc/opt/lib/libnsl.a   /nfsshare/home/abc/opt/lib/libnsl.so    /nfsshare/home/abc/opt/lib/libnsl.so.3.0.0
/nfsshare/home/abc/opt/lib/libnsl.la  /nfsshare/home/abc/opt/lib/libnsl.so.3
jsquyres commented 1 year ago

Were you using ld -lnsl to see if libnsl.so is in your linker search path? If so, I don't think that's the right command. You can look at the output of ldconfig -v -C /tmp/bogus, for example (and remove the temporary file /tmp/bogus that it creates afterwards) to see if libnsl.so* is somewhere in the search path.

If you need /nfsshare/home/abc/opt/lib to be in the search path, you probably need to add it to the LD_LIBRARY_PATH environment variable, not LIBRARY_PATH. Make sure to do this before you invoke Open MPI's configure, and before any Open MPI command (such as mpirun).

awlauria commented 1 year ago

Is there an update on this - should this be closed? This looks like a configuration issue, removing from the v5.0.0 blocking list unless someone feels strongly otherwise.