openhpc / ohpc

OpenHPC Integration, Packaging, and Test Repo
http://openhpc.community
Apache License 2.0
856 stars 187 forks source link

Provide Score-P v8.4 and Scalasca v2.6.1 #1985

Closed Thyre closed 3 months ago

Thyre commented 4 months ago

Heya,

this is the second part of updating our perftools developed at JSC to the newest versions. This MR changes the installation of Score-P and Scalasca a bit, moving from Score-P as the central module to using the modules provided in #1983. This allows cleaner installations of these modules and other tools to use OTF2 & OPARI2 for example.

Things to note:

Depends #1983

github-actions[bot] commented 4 months ago

Test Results

 34 files  + 20   34 suites  +20   3m 7s :stopwatch: + 3m 1s 225 tests +184  225 :white_check_mark: +184  0 :zzz: ±0  0 :x: ±0  238 runs  +184  238 :white_check_mark: +184  0 :zzz: ±0  0 :x: ±0 

Results for commit af1a4fff. ± Comparison against base commit 792e7cd3.

:recycle: This comment has been updated with latest results.

adrianreber commented 4 months ago

As described in the #1983 PR you can trigger the testing we do locally (in a container for example), with tests/ci/setup_slurm_and_run_tests.sh ohpc gnu13 components/perf-tools/scorep/SPECS/scorep.spec. Currently this still needs to have the correct mapping in tests/ci/spec_to_test_mapping.py. That would be the way how we would run make installcheck during GitHub Actions and also during our full CI runs before a release.

Thyre commented 4 months ago

Score-P currently fails some tests because of the following error:

./compiler_filter_test: error while loading shared libraries: libotf2.so.10: cannot open shared object file: No such file or directory

The error persists even when loading OTF2 during the check step as well. I need to investigate this further.


Update: May 28th: The issue occurred because some modules were missing LD_LIBRARY_PATH. Normally, we would not need them, but this breaks on some systems. Since other projects might not link OTF2 and so on with rpathing, I changed these modules to include LD_LIBRARY_PATH.

Now, Score-P seems to fail on another check. Looking into it...


Score-P fails the constructor check because of the following reason:

When Score-P is used, libstdc++.so from /usr/lib64/ is linked, which causes issues with the gnu13 module. The module brings a newer libstdc++.so, which is ignored. This probably happens because of our dependency graph generation we currently still have (and will be removed in Score-P 9.0) and the binutils-devel package from the Alma Linux repository. There are solutions to this, e.g. using our bundled libbfd for Score-P only. I will try to find an elegant solution.

Thyre commented 4 months ago

With the latest force-push, Score-P can build in the Alma Linux container. However, make check fails depending on the user. With root, the OpenMPI build (expectedly) fails when testing MPI stuff. When running as ohpc, PAPI tests fail for a reason I don't understand yet. My guess is that these are more related to my set-up in a container and less to the actual files.

I also only tested gnu13 until now. Others will follow. If Score-P is verified to work, I'll check Scalasca and add a commit to include a subset of our installchecks. Marking the PR as draft until then.

adrianreber commented 4 months ago

Yes, papi testing is skipped in the container based setup. We do that by skipping certain tests in the GitHub container. Basically everything around hardware counters doesn't work in the container. Or it didn't work and we quickly decided to skip it without looking further into it.

Thyre commented 4 months ago

Yes, papi testing is skipped in the container based setup. We do that by skipping certain tests in the GitHub container. Basically everything around hardware counters doesn't work in the container. Or it didn't work and we quickly decided to skip it without looking further into it.

I'll add a patch to Score-P then, which will skip this particular test. Thanks.

adrianreber commented 4 months ago

We have the variable SIMPLE_CI set to one for things we are skipping in GitHub Actions

Thyre commented 4 months ago

For Alma, both Score-P and Scalasca can now build with the GNU toolchain. Still need to check Intel compilers and push the changes.

I've encountered a new issue with OpenSUSE. Since Score-P v8.0, we require a functional libbfd, which can normally be installed as part of the binutils-devel package in OpenSUSE. However, the package is somewhat broken right now, requiring us to link several additional libraries and will cause configure to fail. I'll check if using our bundled binutils solves this issue. With Alma, I ran into rpathing issues. Doing a quick check on SUSE seemed to work though.

Update: I fixed the OpenSUSE build issues by applying a patch to our configure. OpenSUSE & Alma Linux can now build both with the gnu13 toolchain. Checking Intel next.

Thyre commented 4 months ago

OpenSUSE and Alma Linux were successfully tested with Score-P & Scalasca + all three MPI variants + gnu13 / oneAPI (for x86-64), if the packages from #1983 are installed first. Now, only the tests should be missing. I'll try to port our installchecks from a generic build. This should cover most of the instrumentation. If Score-P actually produces results is already checked with the existing Scalasca tests.

Thyre commented 3 months ago

Updating the CI environment works. There's a very small issue left with OTF2, which prevents building Score-P correctly.

diff --git a/components/io-libs/otf2/SPECS/otf2.spec b/components/io-libs/otf2/SPECS/otf2.spec
index cd88c58f3..da746fbda 100644
--- a/components/io-libs/otf2/SPECS/otf2.spec
+++ b/components/io-libs/otf2/SPECS/otf2.spec
@@ -32,6 +32,7 @@ BuildRequires:  chrpath dos2unix
 BuildRequires:  libtool automake
 BuildRequires:  sionlib-%{compiler_family}-%{mpi_family}%{PROJ_DELIM}
 Requires:       lmod%{PROJ_DELIM} >= 7.6.1
+Requires:       sionlib-%{compiler_family}-%{mpi_family}%{PROJ_DELIM}

 # Default library install path
 %define install_path %{OHPC_LIBS}/%{compiler_family}/%{mpi_family}/%{pname}/%version

Should I open a separate PR for this? With this fix, the pipeline should succeed, but tests and things like updating the rpmlintrc still need to be done.

adrianreber commented 3 months ago

Should I open a separate PR for this? With this fix, the pipeline should succeed, but tests and things like updating the rpmlintrc still need to be done.

It would be cleaner, but also unnecessary complicated. Just add another commit to this PR. Maybe re-order the commits to have the otf2.spec fix and the CI update to 3.2 fix before the scorep commits.

Thyre commented 3 months ago

Scalasca and Score-P both fail tests during for OpenMPI. These issues are very likely related to the CI container:

Here's an example for Score-P.

2024-06-05T10:48:30.9515323Z --------------------------------------------------------------------------
2024-06-05T10:48:30.9515827Z There are not enough slots available in the system to satisfy the 4
2024-06-05T10:48:30.9516301Z slots that were requested by the application:
2024-06-05T10:48:30.9516559Z 
2024-06-05T10:48:30.9516648Z   ./mpi_hello_world
2024-06-05T10:48:30.9516797Z 
2024-06-05T10:48:30.9517028Z Either request fewer procs for your application, or make more slots
2024-06-05T10:48:30.9517674Z available for use.
2024-06-05T10:48:30.9517821Z 
2024-06-05T10:48:30.9518024Z A "slot" is the PRRTE term for an allocatable unit where we can
2024-06-05T10:48:30.9518662Z launch a process.  The number of slots available are defined by the
2024-06-05T10:48:30.9519213Z environment in which PRRTE processes are run:
2024-06-05T10:48:30.9519469Z 
2024-06-05T10:48:30.9519653Z   1. Hostfile, via "slots=N" clauses (N defaults to number of
2024-06-05T10:48:30.9520160Z      processor cores if not provided)
2024-06-05T10:48:30.9520739Z   2. The --host command line parameter, via a ":N" suffix on the
2024-06-05T10:48:30.9521195Z      hostname (N defaults to 1 if not provided)
2024-06-05T10:48:30.9521622Z   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
2024-06-05T10:48:30.9522176Z   4. If none of a hostfile, the --host command line parameter, or an
2024-06-05T10:48:30.9522710Z      RM is present, PRRTE defaults to the number of processor cores
2024-06-05T10:48:30.9523038Z 
2024-06-05T10:48:30.9523257Z In all the above cases, if you want PRRTE to default to the number
2024-06-05T10:48:30.9523834Z of hardware threads instead of the number of processor cores, use the
2024-06-05T10:48:30.9524313Z --use-hwthread-cpus option.
2024-06-05T10:48:30.9524497Z 
2024-06-05T10:48:30.9524801Z Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
2024-06-05T10:48:30.9525393Z number of available slots when deciding the number of processes to
2024-06-05T10:48:30.9525897Z launch.
2024-06-05T10:48:30.9526213Z --------------------------------------------------------------------------
2024-06-05T10:48:30.9526628Z �FAIL: mpi_hello_world

I'll look into how this can be worked around. I really would prefer to keep the MPI tests during %check enabled for both.

adrianreber commented 3 months ago

We hardly do any testing during build. This all happening once the build is done and installed. The environment our RPMs are built can be very different. We already have GitHub Actions and OBS and OBS comes with different builders.

Users can also rebuild packages themselves, so any assumption about the environment will probably be wrong in some case. For us it would work to run the tests after the RPM has been built. For GitHub Actions that means a fake slurm setup and for the real CI it means a (at least) two node cluster with slurm/openpbs.

Thyre commented 3 months ago

We hardly do any testing during build. This all happening once the build is done and installed. The environment our RPMs are built can be very different. We already have GitHub Actions and OBS and OBS comes with different builders.

Users can also rebuild packages themselves, so any assumption about the environment will probably be wrong in some case. For us it would work to run the tests after the RPM has been built. For GitHub Actions that means a fake slurm setup and for the real CI it means a (at least) two node cluster with slurm/openpbs.

In that case, I would suggest removing the Score-P and Scalasca testing from the .spec file and I'll work on getting tests for Score-P ready. Scalasca already has some tests. @geimer, should we do some additional testing outside the ones done here or are those sufficient for Scalasca?

adrianreber commented 3 months ago

I see you found our Scalasca tests, nice. If you, upstream, think that this is not enough or if it could be tested better, we are happy to extend the existing tests.

adrianreber commented 3 months ago

The RHEL(intel) error happens because module load scalasca fails. Not sure why it works with the non Intel compiler.

Thyre commented 3 months ago

I'll push the changes removing make check for Score-P / Scalasca in a moment. If it still fails, I'll look into it probably tomorrow and try to reproduce it in a container.

geimer commented 3 months ago

In that case, I would suggest removing the Score-P and Scalasca testing from the .spec file and I'll work on getting tests for Score-P ready. Scalasca already has some tests. @geimer, should we do some additional testing outside the ones done here or are those sufficient for Scalasca?

I don't think I understood all the details yet, but it seems that these tests mostly cover Score-P and CubeLib functionality rather than Scalasca. Let's discuss this offline to then propose something better.

Thyre commented 3 months ago

The RHEL(intel) error happens because module load scalasca fails. Not sure why it works with the non Intel compiler.

It looks like the issue was the setup function. Removing it solved the issue. With the current tests passing, I'll work on the tests for Score-P & Scalasca.

adrianreber commented 3 months ago

The RHEL(intel) error happens because module load scalasca fails. Not sure why it works with the non Intel compiler.

It looks like the issue was the setup function. Removing it solved the issue. With the current tests passing, I'll work on the tests for Score-P & Scalasca.

Unfortunately the git history gives no details why the setup function exists. It seems to be from an "initial import". But I have not seen such a construct in any other test files before. So removing it seems okay.

Thyre commented 3 months ago

Unfortunately the git history gives no details why the setup function exists. It seems to be from an "initial import". But I have not seen such a construct in any other test files before. So removing it seems okay.

There are a few other tests (TAU, Extra-E & Dimemas), which include the same setup function. Might be worth checking if those fail the same way.

Thyre commented 3 months ago

New Score-P scripts need to be added to https://github.com/openhpc/ohpc/blob/3.x/tests/ci/Makefile


Tests (including the new Score-P ones) should hopefully pass now. Lets see.

Thyre commented 3 months ago

The PR should be ready from my side. I've reordered the commits just now, but that should be it. I've tested everything in a Rocky Linux 9 VM where everything seemed to work just fine.

adrianreber commented 3 months ago

Thanks for your work. This PR is huge now. At this point I think I will merge it soon but I am bit worried. Just because it is so big.

The next OpenHPC release will probably be in November. I guess I will start with regular test runs sometime in September.

If something does not work any more I will just reach out to you :wink:

Thyre commented 3 months ago

Thanks for your work. This PR is huge now. At this point I think I will merge it soon but I am bit worried. Just because it is so big.

I agree that the PR has gotten quite large. A large amount (around 1.1k additions) are only the added Score-P tests. The Scalasca tests are also a lot of changes, which basically boil down to moving stuff around to allow testing MPI and OpenMP variants. You should be able to look at the individual commits, as they're each focused on a single thing.

If something does not work any more I will just reach out to you :wink:

Sure! You can easily reach me via mail and on GitHub. Also, if there are any questions, feel free to reach out 😄

adrianreber commented 3 months ago

@Thyre building fails on aarch64 Leap15

https://obs.openhpc.community/project/monitor/OpenHPC3:3.2:Factory?arch_aarch64=1&defaults=0&failed=1&repo_Leap_15=1

Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?

Thyre commented 3 months ago

@Thyre building fails on aarch64 Leap15

https://obs.openhpc.community/project/monitor/OpenHPC3:3.2:Factory?arch_aarch64=1&defaults=0&failed=1&repo_Leap_15=1

Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?

Interesting, seems like x86_64 failed as well. I'll take a look next week as I'm on vacation right now.

The patch to work around the initial issue (libbfd only existing statically and requires linking additional libs) is still there. I wonder what has changed.

adrianreber commented 3 months ago

@mslacken Just tagging you here in case you have an idea why linking against libbfd fails on Leap.

Thyre commented 2 months ago

I've noticed that the Open Build Service build is trying to install an older version of binutils-devel compared to the CI build:

OBS:

########################################
[  642s] binutils-devel-2.39-150100.7.40.1

CI:

2024-07-04T14:57:57.2150467Z Retrieving: binutils-devel-2.41-150100.7.46.1.x86_64 (Update repository with updates from SUSE Linux Enterprise 15) (7/9),  14.3 MiB 

I'll check if I can get the older version installed in my VM next week. Then, I can investigate what breaks our libbfd detection.

adrianreber commented 2 months ago

Interesting that the package versions are different. The build system does not access the repositories in the same way as our GitHub Actions CI so that might the reason. I never really understood how the build system (OBS) downloads the RPMs. It uses some OBS specific way and not the published repositories as far as I know. It is confusing.

mslacken commented 2 months ago

Right, it's completely confusing. I would just wait for some days and wait if binutils-devel gets updated from the openBuild service. Sorry that I could not get more insight here.

adrianreber commented 2 months ago

@Thyre All your tests are running successful in the real CI environment:

https://repos.openhpc.community/results/3/3.2/2024-07-05-09-33-05-PASS-OHPC-3.2-almalinux9.2-x86_64-slurm-3483/junit.html

Cluster with one head node and two compute nodes.

Thyre commented 2 months ago

@Thyre building fails on aarch64 Leap15

https://obs.openhpc.community/project/monitor/OpenHPC3:3.2:Factory?arch_aarch64=1&defaults=0&failed=1&repo_Leap_15=1

Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?

It basically boils down to this:

binutils-devel-2.41-150100.7.46.1 requires linking -lsframe for it to work correctly. Trying to use the same option with binutils-devel-2.39-150100.7.40.1 will cause sframe to not be found.

configure:35956: ./libtool --tag=CC --mode=link $CC $CFLAGS $CPPFLAGS $LTLDFLAGS -o libconftest.la -rpath `pwd` libconftest.lo $LTLIBS >&5
libtool: link: mpicc -shared  -fPIC -DPIC  .libs/libconftest.o   -lbfd -liberty -lz -ldl -lsframe  -O3 -g -fstack-protector-strong -grecord-gcc-switches -mtune=generic -m64   -Wl,-soname -Wl,libconftest.so.0 -o .libs/libconftest.so.0.0.0
/usr/bin/ld: cannot find -lsframe: No such file or directory
Thyre commented 2 months ago

Interesting that the package versions are different. The build system does not access the repositories in the same way as our GitHub Actions CI so that might the reason. I never really understood how the build system (OBS) downloads the RPMs. It uses some OBS specific way and not the published repositories as far as I know. It is confusing.

2.41 seems to come from the SUSE Linux Enterprise repositories. I guess they aren't enabled in the OBS builders. This is why the older 2.39 is used instead.

jreuter@localhost:~> sudo zypper info binutils-devel
Loading repository data...
Reading installed packages...

Information for package binutils-devel:
---------------------------------------
Repository     : openSUSE-Leap-15.5-1
Name           : binutils-devel
Version        : 2.39-150100.7.40.1
Arch           : x86_64
Vendor         : SUSE LLC <https://www.suse.com/>
Installed Size : 50.5 MiB
Installed      : Yes
Status         : up-to-date
Source package : binutils-2.39-150100.7.40.1.src
Upstream URL   : https://www.gnu.org/software/binutils/
Summary        : GNU binutils (BFD development files)
Description    : 
    This package includes header files and static libraries necessary to
    build programs which use the GNU BFD library, which is part of
    binutils.

jreuter@localhost:~> # Manually enabled the SLE15 repo
jreuter@localhost:~> sudo zypper info binutils-devel
Loading repository data...
Reading installed packages...

Information for package binutils-devel:
---------------------------------------
Repository     : Update repository with updates from SUSE Linux Enterprise 15
Name           : binutils-devel
Version        : 2.41-150100.7.46.1
Arch           : x86_64
Vendor         : SUSE LLC <https://www.suse.com/>
Installed Size : 52.9 MiB
Installed      : Yes
Status         : out-of-date (version 2.39-150100.7.40.1 installed)
Source package : binutils-2.41-150100.7.46.1.src
Upstream URL   : https://www.gnu.org/software/binutils/
Summary        : GNU binutils (BFD development files)
Description    : 
    This package includes header files and static libraries necessary to
    build programs which use the GNU BFD library, which is part of
    binutils.

How should we proceed here? The two versions require different patches to work correctly.

Thyre commented 2 months ago

OpenSUSE Leap 15.6 provides the new version. Leap 15.5 only has the old one in the OSS repositories:

f85b81678f65:/ # cat /etc/os-release | head -n 2
NAME="openSUSE Leap"
VERSION="15.5"
f85b81678f65:/ # zypper search -s binutils-devel
Loading repository data...
Reading installed packages...

S | Name                   | Type    | Version            | Arch   | Repository
--+------------------------+---------+--------------------+--------+-------------------------------------------------------------
  | binutils-devel         | package | 2.41-150100.7.46.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
  | binutils-devel         | package | 2.39-150100.7.43.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
  | binutils-devel         | package | 2.39-150100.7.40.1 | x86_64 | Main Repository
  | binutils-devel-32bit   | package | 2.41-150100.7.46.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
  | binutils-devel-32bit   | package | 2.39-150100.7.43.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
  | binutils-devel-32bit   | package | 2.39-150100.7.40.1 | x86_64 | Main Repository
  | mingw64-binutils-devel | package | 2.32-bp155.2.11    | noarch | Main Repository
jreuter@localhost:~> cat /etc/os-release | head -n 2
NAME="openSUSE Leap"
VERSION="15.6"
jreuter@localhost:~> sudo zypper search -s binutils-devel
Loading repository data...
Reading installed packages...

S  | Name                   | Type    | Version            | Arch   | Repository
---+------------------------+---------+--------------------+--------+---------------------
i+ | binutils-devel         | package | 2.41-150100.7.46.1 | x86_64 | openSUSE-Leap-15.6-1
i+ | binutils-devel         | package | 2.41-150100.7.46.1 | x86_64 | Main Repository
   | binutils-devel-32bit   | package | 2.41-150100.7.46.1 | x86_64 | openSUSE-Leap-15.6-1
   | binutils-devel-32bit   | package | 2.41-150100.7.46.1 | x86_64 | Main Repository
   | mingw64-binutils-devel | package | 2.32-bp156.3.1     | noarch | openSUSE-Leap-15.6-1
   | mingw64-binutils-devel | package | 2.32-bp156.3.1     | noarch | Main Repository

I will check what happens if a package is built with the new / old binutils-devel and installed on a system with the old / new version.


Update:

Built with 2.41, then force-downgraded to 2.39 (zypper will complain because of the dependency on 2.41):

f85b81678f65:/ohpc # scorep-gcc test.c
/usr/bin/ld: cannot find -lsframe: No such file or directory
collect2: error: ld returned 1 exit status
[Score-P] ERROR: Execution failed: gcc .scorep_init.o /opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/lib/scorep/scorep_compiler_gcc_plugin_begin.o test_1720424246_625173.o /opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/lib/scorep/scorep_compiler_gcc_plugin_end.o `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --constructor` `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --ldflags`  -Wl,-start-group `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --event-libs`  `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --mgmt-libs` -Wl,-end-group

Building with 2.39 and then upgrading to 2.41 seems to work when just doing a basic test, but I cannot guarantee that this will not break for more complex examples. The build will certainly break once OBS updates to 2.41. It will also break the CI, as 2.41 is installed there...

I opened an issue in the OpenSUSE bugzilla a month ago. If this would be fixed in the package itself, things would be much easier: https://bugzilla.suse.com/show_bug.cgi?id=1225824

adrianreber commented 2 months ago

@Thyre I was able to add the Update repository with updates from SUSE Linux Enterprise 15 repository as an external repository to the build system. I think we are all good now. No need to find workarounds.