Closed Thyre closed 3 months ago
34 files + 20 34 suites +20 3m 7s :stopwatch: + 3m 1s 225 tests +184 225 :white_check_mark: +184 0 :zzz: ±0 0 :x: ±0 238 runs +184 238 :white_check_mark: +184 0 :zzz: ±0 0 :x: ±0
Results for commit af1a4fff. ± Comparison against base commit 792e7cd3.
:recycle: This comment has been updated with latest results.
As described in the #1983 PR you can trigger the testing we do locally (in a container for example), with tests/ci/setup_slurm_and_run_tests.sh ohpc gnu13 components/perf-tools/scorep/SPECS/scorep.spec
. Currently this still needs to have the correct mapping in tests/ci/spec_to_test_mapping.py
. That would be the way how we would run make installcheck
during GitHub Actions and also during our full CI runs before a release.
Score-P currently fails some tests because of the following error:
./compiler_filter_test: error while loading shared libraries: libotf2.so.10: cannot open shared object file: No such file or directory
The error persists even when loading OTF2 during the check step as well. I need to investigate this further.
Update: May 28th: The issue occurred because some modules were missing LD_LIBRARY_PATH
. Normally, we would not need them, but this breaks on some systems. Since other projects might not link OTF2 and so on with rpathing, I changed these modules to include LD_LIBRARY_PATH
.
Now, Score-P seems to fail on another check. Looking into it...
Score-P fails the constructor check because of the following reason:
When Score-P is used, libstdc++.so
from /usr/lib64/
is linked, which causes issues with the gnu13
module. The module brings a newer libstdc++.so
, which is ignored. This probably happens because of our dependency graph generation we currently still have (and will be removed in Score-P 9.0) and the binutils-devel
package from the Alma Linux repository. There are solutions to this, e.g. using our bundled libbfd
for Score-P only. I will try to find an elegant solution.
With the latest force-push, Score-P can build in the Alma Linux container. However, make check
fails depending on the user. With root
, the OpenMPI build (expectedly) fails when testing MPI stuff. When running as ohpc
, PAPI tests fail for a reason I don't understand yet. My guess is that these are more related to my set-up in a container and less to the actual files.
I also only tested gnu13
until now. Others will follow. If Score-P is verified to work, I'll check Scalasca and add a commit to include a subset of our installchecks. Marking the PR as draft until then.
Yes, papi testing is skipped in the container based setup. We do that by skipping certain tests in the GitHub container. Basically everything around hardware counters doesn't work in the container. Or it didn't work and we quickly decided to skip it without looking further into it.
Yes, papi testing is skipped in the container based setup. We do that by skipping certain tests in the GitHub container. Basically everything around hardware counters doesn't work in the container. Or it didn't work and we quickly decided to skip it without looking further into it.
I'll add a patch to Score-P then, which will skip this particular test. Thanks.
We have the variable SIMPLE_CI set to one for things we are skipping in GitHub Actions
For Alma, both Score-P and Scalasca can now build with the GNU toolchain. Still need to check Intel compilers and push the changes.
I've encountered a new issue with OpenSUSE. Since Score-P v8.0, we require a functional libbfd
, which can normally be installed as part of the binutils-devel
package in OpenSUSE. However, the package is somewhat broken right now, requiring us to link several additional libraries and will cause configure
to fail. I'll check if using our bundled binutils
solves this issue. With Alma, I ran into rpathing issues. Doing a quick check on SUSE seemed to work though.
Update: I fixed the OpenSUSE build issues by applying a patch to our configure. OpenSUSE & Alma Linux can now build both with the gnu13 toolchain. Checking Intel next.
OpenSUSE and Alma Linux were successfully tested with Score-P & Scalasca + all three MPI variants + gnu13 / oneAPI (for x86-64), if the packages from #1983 are installed first. Now, only the tests should be missing. I'll try to port our installchecks from a generic build. This should cover most of the instrumentation. If Score-P actually produces results is already checked with the existing Scalasca tests.
Updating the CI environment works. There's a very small issue left with OTF2, which prevents building Score-P correctly.
diff --git a/components/io-libs/otf2/SPECS/otf2.spec b/components/io-libs/otf2/SPECS/otf2.spec
index cd88c58f3..da746fbda 100644
--- a/components/io-libs/otf2/SPECS/otf2.spec
+++ b/components/io-libs/otf2/SPECS/otf2.spec
@@ -32,6 +32,7 @@ BuildRequires: chrpath dos2unix
BuildRequires: libtool automake
BuildRequires: sionlib-%{compiler_family}-%{mpi_family}%{PROJ_DELIM}
Requires: lmod%{PROJ_DELIM} >= 7.6.1
+Requires: sionlib-%{compiler_family}-%{mpi_family}%{PROJ_DELIM}
# Default library install path
%define install_path %{OHPC_LIBS}/%{compiler_family}/%{mpi_family}/%{pname}/%version
Should I open a separate PR for this? With this fix, the pipeline should succeed, but tests and things like updating the rpmlintrc still need to be done.
Should I open a separate PR for this? With this fix, the pipeline should succeed, but tests and things like updating the rpmlintrc still need to be done.
It would be cleaner, but also unnecessary complicated. Just add another commit to this PR. Maybe re-order the commits to have the otf2.spec
fix and the CI update to 3.2 fix before the scorep commits.
Scalasca and Score-P both fail tests during for OpenMPI. These issues are very likely related to the CI container:
Here's an example for Score-P.
2024-06-05T10:48:30.9515323Z --------------------------------------------------------------------------
2024-06-05T10:48:30.9515827Z There are not enough slots available in the system to satisfy the 4
2024-06-05T10:48:30.9516301Z slots that were requested by the application:
2024-06-05T10:48:30.9516559Z
2024-06-05T10:48:30.9516648Z ./mpi_hello_world
2024-06-05T10:48:30.9516797Z
2024-06-05T10:48:30.9517028Z Either request fewer procs for your application, or make more slots
2024-06-05T10:48:30.9517674Z available for use.
2024-06-05T10:48:30.9517821Z
2024-06-05T10:48:30.9518024Z A "slot" is the PRRTE term for an allocatable unit where we can
2024-06-05T10:48:30.9518662Z launch a process. The number of slots available are defined by the
2024-06-05T10:48:30.9519213Z environment in which PRRTE processes are run:
2024-06-05T10:48:30.9519469Z
2024-06-05T10:48:30.9519653Z 1. Hostfile, via "slots=N" clauses (N defaults to number of
2024-06-05T10:48:30.9520160Z processor cores if not provided)
2024-06-05T10:48:30.9520739Z 2. The --host command line parameter, via a ":N" suffix on the
2024-06-05T10:48:30.9521195Z hostname (N defaults to 1 if not provided)
2024-06-05T10:48:30.9521622Z 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
2024-06-05T10:48:30.9522176Z 4. If none of a hostfile, the --host command line parameter, or an
2024-06-05T10:48:30.9522710Z RM is present, PRRTE defaults to the number of processor cores
2024-06-05T10:48:30.9523038Z
2024-06-05T10:48:30.9523257Z In all the above cases, if you want PRRTE to default to the number
2024-06-05T10:48:30.9523834Z of hardware threads instead of the number of processor cores, use the
2024-06-05T10:48:30.9524313Z --use-hwthread-cpus option.
2024-06-05T10:48:30.9524497Z
2024-06-05T10:48:30.9524801Z Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
2024-06-05T10:48:30.9525393Z number of available slots when deciding the number of processes to
2024-06-05T10:48:30.9525897Z launch.
2024-06-05T10:48:30.9526213Z --------------------------------------------------------------------------
2024-06-05T10:48:30.9526628Z �FAIL: mpi_hello_world
I'll look into how this can be worked around. I really would prefer to keep the MPI tests during %check
enabled for both.
We hardly do any testing during build. This all happening once the build is done and installed. The environment our RPMs are built can be very different. We already have GitHub Actions and OBS and OBS comes with different builders.
Users can also rebuild packages themselves, so any assumption about the environment will probably be wrong in some case. For us it would work to run the tests after the RPM has been built. For GitHub Actions that means a fake slurm setup and for the real CI it means a (at least) two node cluster with slurm/openpbs.
We hardly do any testing during build. This all happening once the build is done and installed. The environment our RPMs are built can be very different. We already have GitHub Actions and OBS and OBS comes with different builders.
Users can also rebuild packages themselves, so any assumption about the environment will probably be wrong in some case. For us it would work to run the tests after the RPM has been built. For GitHub Actions that means a fake slurm setup and for the real CI it means a (at least) two node cluster with slurm/openpbs.
In that case, I would suggest removing the Score-P and Scalasca testing from the .spec
file and I'll work on getting tests for Score-P ready. Scalasca already has some tests. @geimer, should we do some additional testing outside the ones done here or are those sufficient for Scalasca?
I see you found our Scalasca tests, nice. If you, upstream, think that this is not enough or if it could be tested better, we are happy to extend the existing tests.
The RHEL(intel) error happens because module load scalasca
fails. Not sure why it works with the non Intel compiler.
I'll push the changes removing make check
for Score-P / Scalasca in a moment. If it still fails, I'll look into it probably tomorrow and try to reproduce it in a container.
In that case, I would suggest removing the Score-P and Scalasca testing from the
.spec
file and I'll work on getting tests for Score-P ready. Scalasca already has some tests. @geimer, should we do some additional testing outside the ones done here or are those sufficient for Scalasca?
I don't think I understood all the details yet, but it seems that these tests mostly cover Score-P and CubeLib functionality rather than Scalasca. Let's discuss this offline to then propose something better.
The RHEL(intel) error happens because
module load scalasca
fails. Not sure why it works with the non Intel compiler.
It looks like the issue was the setup function. Removing it solved the issue. With the current tests passing, I'll work on the tests for Score-P & Scalasca.
The RHEL(intel) error happens because
module load scalasca
fails. Not sure why it works with the non Intel compiler.It looks like the issue was the setup function. Removing it solved the issue. With the current tests passing, I'll work on the tests for Score-P & Scalasca.
Unfortunately the git history gives no details why the setup function exists. It seems to be from an "initial import". But I have not seen such a construct in any other test files before. So removing it seems okay.
Unfortunately the git history gives no details why the setup function exists. It seems to be from an "initial import". But I have not seen such a construct in any other test files before. So removing it seems okay.
There are a few other tests (TAU, Extra-E & Dimemas), which include the same setup function. Might be worth checking if those fail the same way.
New Score-P scripts need to be added to https://github.com/openhpc/ohpc/blob/3.x/tests/ci/Makefile
Tests (including the new Score-P ones) should hopefully pass now. Lets see.
The PR should be ready from my side. I've reordered the commits just now, but that should be it. I've tested everything in a Rocky Linux 9 VM where everything seemed to work just fine.
Thanks for your work. This PR is huge now. At this point I think I will merge it soon but I am bit worried. Just because it is so big.
The next OpenHPC release will probably be in November. I guess I will start with regular test runs sometime in September.
If something does not work any more I will just reach out to you :wink:
Thanks for your work. This PR is huge now. At this point I think I will merge it soon but I am bit worried. Just because it is so big.
I agree that the PR has gotten quite large. A large amount (around 1.1k additions) are only the added Score-P tests. The Scalasca tests are also a lot of changes, which basically boil down to moving stuff around to allow testing MPI and OpenMP variants. You should be able to look at the individual commits, as they're each focused on a single thing.
If something does not work any more I will just reach out to you :wink:
Sure! You can easily reach me via mail and on GitHub. Also, if there are any questions, feel free to reach out 😄
@Thyre building fails on aarch64 Leap15
Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?
@Thyre building fails on aarch64 Leap15
Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?
Interesting, seems like x86_64 failed as well. I'll take a look next week as I'm on vacation right now.
The patch to work around the initial issue (libbfd only existing statically and requires linking additional libs) is still there. I wonder what has changed.
@mslacken Just tagging you here in case you have an idea why linking against libbfd fails on Leap.
I've noticed that the Open Build Service build is trying to install an older version of binutils-devel
compared to the CI build:
OBS:
########################################
[ 642s] binutils-devel-2.39-150100.7.40.1
CI:
2024-07-04T14:57:57.2150467Z Retrieving: binutils-devel-2.41-150100.7.46.1.x86_64 (Update repository with updates from SUSE Linux Enterprise 15) (7/9), 14.3 MiB
I'll check if I can get the older version installed in my VM next week. Then, I can investigate what breaks our libbfd
detection.
Interesting that the package versions are different. The build system does not access the repositories in the same way as our GitHub Actions CI so that might the reason. I never really understood how the build system (OBS) downloads the RPMs. It uses some OBS specific way and not the published repositories as far as I know. It is confusing.
Right, it's completely confusing. I would just wait for some days and wait if binutils-devel
gets updated from the openBuild service.
Sorry that I could not get more insight here.
@Thyre All your tests are running successful in the real CI environment:
Cluster with one head node and two compute nodes.
@Thyre building fails on aarch64 Leap15
Can you take a look? I think you already had to work around "configure: error: Cannot link libbfd (and dependencies)." previously, right?
It basically boils down to this:
binutils-devel-2.41-150100.7.46.1
requires linking -lsframe
for it to work correctly. Trying to use the same option with binutils-devel-2.39-150100.7.40.1
will cause sframe
to not be found.
configure:35956: ./libtool --tag=CC --mode=link $CC $CFLAGS $CPPFLAGS $LTLDFLAGS -o libconftest.la -rpath `pwd` libconftest.lo $LTLIBS >&5
libtool: link: mpicc -shared -fPIC -DPIC .libs/libconftest.o -lbfd -liberty -lz -ldl -lsframe -O3 -g -fstack-protector-strong -grecord-gcc-switches -mtune=generic -m64 -Wl,-soname -Wl,libconftest.so.0 -o .libs/libconftest.so.0.0.0
/usr/bin/ld: cannot find -lsframe: No such file or directory
Interesting that the package versions are different. The build system does not access the repositories in the same way as our GitHub Actions CI so that might the reason. I never really understood how the build system (OBS) downloads the RPMs. It uses some OBS specific way and not the published repositories as far as I know. It is confusing.
2.41
seems to come from the SUSE Linux Enterprise repositories. I guess they aren't enabled in the OBS builders. This is why the older 2.39
is used instead.
jreuter@localhost:~> sudo zypper info binutils-devel
Loading repository data...
Reading installed packages...
Information for package binutils-devel:
---------------------------------------
Repository : openSUSE-Leap-15.5-1
Name : binutils-devel
Version : 2.39-150100.7.40.1
Arch : x86_64
Vendor : SUSE LLC <https://www.suse.com/>
Installed Size : 50.5 MiB
Installed : Yes
Status : up-to-date
Source package : binutils-2.39-150100.7.40.1.src
Upstream URL : https://www.gnu.org/software/binutils/
Summary : GNU binutils (BFD development files)
Description :
This package includes header files and static libraries necessary to
build programs which use the GNU BFD library, which is part of
binutils.
jreuter@localhost:~> # Manually enabled the SLE15 repo
jreuter@localhost:~> sudo zypper info binutils-devel
Loading repository data...
Reading installed packages...
Information for package binutils-devel:
---------------------------------------
Repository : Update repository with updates from SUSE Linux Enterprise 15
Name : binutils-devel
Version : 2.41-150100.7.46.1
Arch : x86_64
Vendor : SUSE LLC <https://www.suse.com/>
Installed Size : 52.9 MiB
Installed : Yes
Status : out-of-date (version 2.39-150100.7.40.1 installed)
Source package : binutils-2.41-150100.7.46.1.src
Upstream URL : https://www.gnu.org/software/binutils/
Summary : GNU binutils (BFD development files)
Description :
This package includes header files and static libraries necessary to
build programs which use the GNU BFD library, which is part of
binutils.
How should we proceed here? The two versions require different patches to work correctly.
OpenSUSE Leap 15.6 provides the new version. Leap 15.5 only has the old one in the OSS repositories:
f85b81678f65:/ # cat /etc/os-release | head -n 2
NAME="openSUSE Leap"
VERSION="15.5"
f85b81678f65:/ # zypper search -s binutils-devel
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
--+------------------------+---------+--------------------+--------+-------------------------------------------------------------
| binutils-devel | package | 2.41-150100.7.46.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
| binutils-devel | package | 2.39-150100.7.43.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
| binutils-devel | package | 2.39-150100.7.40.1 | x86_64 | Main Repository
| binutils-devel-32bit | package | 2.41-150100.7.46.1 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
| binutils-devel-32bit | package | 2.39-150100.7.43.2 | x86_64 | Update repository with updates from SUSE Linux Enterprise 15
| binutils-devel-32bit | package | 2.39-150100.7.40.1 | x86_64 | Main Repository
| mingw64-binutils-devel | package | 2.32-bp155.2.11 | noarch | Main Repository
jreuter@localhost:~> cat /etc/os-release | head -n 2
NAME="openSUSE Leap"
VERSION="15.6"
jreuter@localhost:~> sudo zypper search -s binutils-devel
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
---+------------------------+---------+--------------------+--------+---------------------
i+ | binutils-devel | package | 2.41-150100.7.46.1 | x86_64 | openSUSE-Leap-15.6-1
i+ | binutils-devel | package | 2.41-150100.7.46.1 | x86_64 | Main Repository
| binutils-devel-32bit | package | 2.41-150100.7.46.1 | x86_64 | openSUSE-Leap-15.6-1
| binutils-devel-32bit | package | 2.41-150100.7.46.1 | x86_64 | Main Repository
| mingw64-binutils-devel | package | 2.32-bp156.3.1 | noarch | openSUSE-Leap-15.6-1
| mingw64-binutils-devel | package | 2.32-bp156.3.1 | noarch | Main Repository
I will check what happens if a package is built with the new / old binutils-devel
and installed on a system with the old / new version.
Update:
Built with 2.41, then force-downgraded to 2.39 (zypper
will complain because of the dependency on 2.41):
f85b81678f65:/ohpc # scorep-gcc test.c
/usr/bin/ld: cannot find -lsframe: No such file or directory
collect2: error: ld returned 1 exit status
[Score-P] ERROR: Execution failed: gcc .scorep_init.o /opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/lib/scorep/scorep_compiler_gcc_plugin_begin.o test_1720424246_625173.o /opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/lib/scorep/scorep_compiler_gcc_plugin_end.o `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --constructor` `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --ldflags` -Wl,-start-group `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --event-libs` `/opt/ohpc/pub/libs/gnu13/mpich/scorep/8.4/bin/scorep-config --thread=none --mpp=none --io=none --nocuda --noopencl --noopenacc --nomemory --nokokkos --nohip --mgmt-libs` -Wl,-end-group
Building with 2.39 and then upgrading to 2.41 seems to work when just doing a basic test, but I cannot guarantee that this will not break for more complex examples. The build will certainly break once OBS updates to 2.41. It will also break the CI, as 2.41 is installed there...
I opened an issue in the OpenSUSE bugzilla a month ago. If this would be fixed in the package itself, things would be much easier: https://bugzilla.suse.com/show_bug.cgi?id=1225824
@Thyre I was able to add the Update repository with updates from SUSE Linux Enterprise 15
repository as an external repository to the build system. I think we are all good now. No need to find workarounds.
Heya,
this is the second part of updating our perftools developed at JSC to the newest versions. This MR changes the installation of Score-P and Scalasca a bit, moving from Score-P as the central module to using the modules provided in #1983. This allows cleaner installations of these modules and other tools to use OTF2 & OPARI2 for example.
Things to note:
make installcheck
to Score-P, as this ensures that instrumentation works. This is missing right now. Where should we put it?Depends #1983