Open jhux2 opened 1 month ago
On ascicgpu031
, which is running RHEL8, step 8 of the PR-reproducer instructions results in an error:
realpath: missing operand
Try 'realpath --help' for more information.
Using system 'rhel7' based on matching hostname 'ascicgpu031'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5' in build name 'rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested:
"sems-openmpi/4.0.5-cuda-11.4.2"
Opened a related SEMS ticket: SEMSHELPD-3859.
@jhux2 wanted to update you that we are working on this. The error that you are running into is exactly what's hitting our PR systems for that specific cuda no-uvm build due to the rhel8 upgrade.
Circling back on this, (@jhux2 you may already be aware of this work around, so for anyone else),
Since ascicgpu machines are rhel8 and we do not have a rhel8 cuda config ready yet, in order to reproduce one of the existing rhel7 configurations on the rhel8 machines, you will need to use SEMS' rhel7 modules. This is the work around that is currently implemented for the rhel7 PR tests that run on our rhel8 machines.
The work around is that you will have to make your own copy of /projects/sems/modulefiles/utils/sems-modules-init.sh
, add the rhel8 hostname you're using to the bottom of the script, and source that script. This will load the rhel7 SEMS modules on a rhel8 machine to test a rhel7 PR configuration.
Once we have finalized the rhel8 cuda config, this work around will not be needed we will be able to use the SEMS rhel8 modules.
I've modified the wikipage with this same comment for now.
UPDATE: The bypass has been removed from the official copy of sems-modules-init
script. This is due to the transition that most of the Trilinos PR tests have been converted to using RHEL8 configurations (including GPU tests). If you still need to replicate a RHEL7 environment on a RHEL8 machine, please let me know and I can help you with that as I still have the bypass script.
@achauphan Thanks! I was able to use these instructions to get building on ascicgpu030 again!
@achauphan I'm getting:
+==============================================================================+
| ERROR: The following section(s) in your config-specs.ini file
| do not match any systems listed in
| 'supported-systems.ini':
It really thinks my machine is RHEL8 (which it is).
@csiefer2 for which configuration + machine is this happening on and is this happening with the bypass?
@achauphan Machine: One of the ascicgpus Config: sems-gnu-8.3.0-openmpi-1.10.1
The bypass only generates an error in the shell script, so clearly I'm misunderstanding the instructions somehow.
@csiefer2 I just (this morning, hasn't taken effect in the current AT run) pulled the trigger on migrating those builds (GCC + OpenMPI) to their updated RHEL8 counterparts. We're at the point where only the Intel build should be using the RHEL7 configuration (and we hope to transition that tomorrow morning).
Apologies for the instability this week, we're running up against a pretty rapid deadline and appreciate everybody's understanding!
Question
@trilinos/framework @sebrowne
Many SNL workstations and CEE resources are being upgraded to RHEL8 or RHEL9. How should developers reproduce errors that show up on the dashboard or in PR testing, which uses RHEL7? The
genconfig
instructions are RHEL7 specific.