reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
214 stars 102 forks source link

`srunalloc` launcher makes reframe lose track of the `prerun_cmds` etc. output #3044

Closed vkarak closed 9 months ago

vkarak commented 10 months ago

The problem is that the test's standard output/error files are passed as options to the srun command, thus overriding the output of the whole script. Here's how to reproduce:

Configuration file (you can add the access options accordingly if needed):

site_configuration = {
    'systems': [
        {
            'name': 'system',
            'hostnames': ['nid0'],
            'partitions': [
                {
                    'name': 'part',
                    'scheduler': 'local',
                    'launcher': 'srunalloc',
                    'environs': ['builtin']
                }
            ]
        }
    ]
}

And the test file:

import reframe as rfm
import reframe.utility.sanity as sn

@rfm.simple_test
class srunalloc_fail_test(rfm.RunOnlyRegressionTest):
    executable = 'hostname'
    prerun_cmds = ['echo hello']
    valid_systems = ['system:part']
    valid_prog_environs = ['*']

    @sanity_function
    def validate(self):
        return sn.assert_found('hello', self.stdout)

Running the test fails as follows:

SUMMARY OF FAILURES
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for srunalloc_fail_test (run: 1/1)
  * Description:
  * System partition: system:part
  * Environment: builtin
  * Stage directory: /home/user/reframe/stage/system/part/builtin/srunalloc_fail_test
  * Node list: nid0001
  * Job type: local (id=83006)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /b359e5de -p builtin --system system:part -r'
  * Reason: sanity error: pattern 'hello' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
nid0001
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
--- rfm_job.err ---

Removing the --output and --error srun options here solves the issue:

https://github.com/reframe-hpc/reframe/blob/a3366b6c9ab7567df295fc9f30bae13fd5fa7dfc/reframe/core/launchers/mpi.py#L129-L133