trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.18k stars 559 forks source link

Framework: how to reproduce errors using genconfig and RHEL8 #13022

Open jhux2 opened 1 month ago

jhux2 commented 1 month ago

Question

@trilinos/framework @sebrowne

Many SNL workstations and CEE resources are being upgraded to RHEL8 or RHEL9. How should developers reproduce errors that show up on the dashboard or in PR testing, which uses RHEL7? The genconfig instructions are RHEL7 specific.

jhux2 commented 1 month ago

On ascicgpu031, which is running RHEL8, step 8 of the PR-reproducer instructions results in an error:

realpath: missing operand
Try 'realpath --help' for more information.
Using system 'rhel7' based on matching hostname 'ascicgpu031'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5' in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested:
"sems-openmpi/4.0.5-cuda-11.4.2"
jhux2 commented 1 month ago

Opened a related SEMS ticket: SEMSHELPD-3859.

achauphan commented 1 month ago

@jhux2 wanted to update you that we are working on this. The error that you are running into is exactly what's hitting our PR systems for that specific cuda no-uvm build due to the rhel8 upgrade.

achauphan commented 1 month ago

Circling back on this, (@jhux2 you may already be aware of this work around, so for anyone else),

Since ascicgpu machines are rhel8 and we do not have a rhel8 cuda config ready yet, in order to reproduce one of the existing rhel7 configurations on the rhel8 machines, you will need to use SEMS' rhel7 modules. This is the work around that is currently implemented for the rhel7 PR tests that run on our rhel8 machines.

The work around is that you will have to make your own copy of /projects/sems/modulefiles/utils/sems-modules-init.sh, add the rhel8 hostname you're using to the bottom of the script, and source that script. This will load the rhel7 SEMS modules on a rhel8 machine to test a rhel7 PR configuration.

Once we have finalized the rhel8 cuda config, this work around will not be needed we will be able to use the SEMS rhel8 modules.

I've modified the wikipage with this same comment for now.

UPDATE: The bypass has been removed from the official copy of sems-modules-init script. This is due to the transition that most of the Trilinos PR tests have been converted to using RHEL8 configurations (including GPU tests). If you still need to replicate a RHEL7 environment on a RHEL8 machine, please let me know and I can help you with that as I still have the bypass script.

hkthorn commented 1 month ago

@achauphan Thanks! I was able to use these instructions to get building on ascicgpu030 again!

csiefer2 commented 2 weeks ago

@achauphan I'm getting:

+==============================================================================+
|   ERROR:  The following section(s) in your config-specs.ini file
|           do not match any systems listed in
|           'supported-systems.ini':

It really thinks my machine is RHEL8 (which it is).

achauphan commented 2 weeks ago

@csiefer2 for which configuration + machine is this happening on and is this happening with the bypass?

csiefer2 commented 2 weeks ago

@achauphan Machine: One of the ascicgpus Config: sems-gnu-8.3.0-openmpi-1.10.1

The bypass only generates an error in the shell script, so clearly I'm misunderstanding the instructions somehow.

sebrowne commented 2 weeks ago

@csiefer2 I just (this morning, hasn't taken effect in the current AT run) pulled the trigger on migrating those builds (GCC + OpenMPI) to their updated RHEL8 counterparts. We're at the point where only the Intel build should be using the RHEL7 configuration (and we hope to transition that tomorrow morning).

Apologies for the instability this week, we're running up against a pretty rapid deadline and appreciate everybody's understanding!