open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

torque 6 misses pbs-config #4914

Closed andre-merzky closed 6 years ago

andre-merzky commented 6 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OpenMPI master branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

We installl from a git clone, on Titan, a Cray XK7. The system recently switched from torque 4.x to 6.x. The 4.x modules are gone now, and so is pbs-config. OMPI's configure script however relies on that being available.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

As stated above: OMPI's configire script needs pbs-config, which is not installed by default anymore on Cray's after switch to Torque v6.x. The default MPI on Cray is MPICH, our costom OMPI installation is what complains.

ggouaillardet commented 6 years ago

I do not think pbs-config is mandatory in order build tm support (e.g. PBS/torque)

Could you please compress and upload your config.log ?

andre-merzky commented 6 years ago

Indeed, I am running configure ... --with-tm, and pbs-config is pulled due to that option. But I may be missing something here, of course: without that option, OpenMPI seem not to be able to learn about the available compute nodes. Or at least I saw errors in job placement on mpirun. When using that option (previously, with torque 4.x which had pbs-config), that problem disappeared.

I attached the config.log, but will also ping back later to verify that leaving out --with-tm results in that placement failure.

config.log.gz

ggouaillardet commented 6 years ago

I will check the log later, thanks.

I reviewed config/orte_check_tm.m4 and unless I misunderstood that part

Bottom line, if configure --with-tm succeed, it means tm support was built successfully.

ggouaillardet commented 6 years ago

I checked the log, tm.h was not found, and though --with-tm was specified, configure did not abort (!)

Where is tm.h on your system ? If it is in /usr/include/torque/tm.h then you have to configure --with-tm CPPFLAGS=-I/usr/include/torque

I will investigate on why configure did not abort

ggouaillardet commented 6 years ago

my bad, configure did abort as expected.

since there is no pbs-config you have two options

andre-merzky commented 6 years ago

I should put a hold on this ticket: configurations which resulted in a usable installation on Titan don't work anymore, presumably because of more system changes. Just in case any of you guys is currently working on Titan or any other Cray (Blue Waters is also on my TODO list), I would appreciate any advise. Otherwise I'll ping back once I figured out how to build again.

Thanks, Andre.

rhc54 commented 6 years ago

Well, I just downloaded and built Torque master (which is somewhere on the 6.1 path), and then built OMPI master by simply pointing --with-tm=DIR where DIR is the install location I used for Torque. OMPI correctly picked up the Torque support and built all tm components.

I can't try executing it, of course, but things seem to build just fine. It is quite possible that someone has changed the format of the PBS_NODEFILE, though I'd be surprised as that would break a ton of code out there.

ggouaillardet commented 6 years ago

@rhc54 if torque is built with the default RPM .spec file, then the header is in /usr/include/torque.

So unless pbs-config can be used to set CPPFLAGS, or we update our configury in order to search into this non standard directory (just like we do for pmi), then Open MPI is currently unavailable to find the header file and is hence unable build tm support out of the box.

FWIW, I built the RPMS from the latest 6.1.2 tarball, and pbs-config is part of the torque-devel package with the /usr/include/torque/tm.h header file, so maybe this package is not installed or there could be an other site specific issue.

@andre-merzky once upon a time, a related issue was reported (tm support is built, but does not use the allocated nodes). We ended up concluding the torque install was busted on that site, and pbsdsh was used to evidence this. So you might want to try the following script, that should print (at least) two different hostnames) first.

#!/bin/sh

#PBS -l nodes=2

pbsdsh hostname
jsquyres commented 6 years ago

@andre-merzky Ok, thanks. I'll close this issue -- let us know if we should re-open it.

andre-merzky commented 6 years ago

Thanks for the feedback @jsquyres , @rhc54 ! Titan support did not yet install the torque devel package, but I meanwhile follow Ralph's advice and use a private torque installation. Compilation is smooth that way.