reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
222 stars 103 forks source link

Default environment is required #1453

Closed sjpb closed 4 years ago

sjpb commented 4 years ago

At the moment, reframe seems to require a default environment. If I do something like this:

        {
            'name':'openfoam',
            'target_systems': ['mysys:ib-gcc7-openmpi4-ucx'],
            'modules': ['openfoam-org/7-2ceqb4l']
        },
        {
            'name':'openfoam',
            'target_systems': ['mysys:ib-gcc9-openmpi4-ucx'],
            'modules': ['openfoam-org/7-4zgjbg2']
        },

Then I get a message like "environment 'openfoam' not defined for system 'mysys'" (paraphrased, lost the original terminal).

I have to add an empty default environment of the same name for it to work:

        {
            'name':'openfoam'
        },

Which is fine, because I can use the 'environs' values in the partition to restrict tests to valid partition+environ combinations. Except that it is an error trap - for example if I misspelt the 'target_systems' above, then reframe thinks there is a valid combination because of the default one, then I get a cryptic error during module loads.

To me that case with no default seems quite reasonable/normal, but maybe it's not.

sjpb commented 4 years ago

If the current behavior is preferred maybe it could be noted in the relevant section in here

vkarak commented 4 years ago

Hi @sjpb, how is your environs configuration parameter defined for each of the partitions above? Also how many partitions does msys have and how did you run reframe? Judging from the error message, it seems that reframe tries to run on a mysys partition that defines the openfoam environment and then it can't find a definition for it inside environments.

sjpb commented 4 years ago

I stripped my config down to a minimal example (2 partitions, 2 environments, using pingpong cause it's faster to test than openfoam). Here's the systems and environment bits:

    'systems': [
        {
            'name': 'arcus',
            'hostnames': ['eb-login-0'],
            'modules_system': 'lmod',
            'partitions':[
                {
                    'name':'ib-gcc9-openmpi4-ucx',
                    'scheduler': 'slurm',
                    'access': [ '--partition=test'],
                    'launcher':'srun',
                    'environs': ['imb'],
                    'modules': ['gcc/9.2.0-3j3swca', 'openmpi/4.0.3-dxa6sov'],
                    'variables': [
                        ['SLURM_MPI_TYPE', 'pmix_v2'],
                    ]
                },
                {
                    'name':'ib-gcc9-impi2019-mlx',
                    'scheduler': 'slurm',
                    'launcher':'mpirun',
                    'access': [ '--partition=test'],
                    'environs': ['imb'],
                    'modules': ['gcc/9.2.0-3j3swca', 'intel-mpi/2019.8.254-5qpjevf'],
                    'variables': [
                        ['FI_PROVIDER', 'mlx'],
                    ],
                },
            ]
        },
    ],
    'environments': [
        # {
        #     'name': 'imb',      # a non-targeted environment seems to be necessary for reframe to load the config
        # },
        {
            'name': 'imb',
            'target_systems': ['arcus:ib-gcc9-openmpi4-ucx', 'arcus:roce-gcc9-openmpi4-ucx'],
            'modules': ['intel-mpi-benchmarks/2019.6-42qobhq'],
        },
        {
            'name': 'imb',
            'target_systems': ['arcus:ib-gcc9-impi2019-mlx', 'arcus:roce-gcc9-impi2019-mlx'],
            'modules': ['intel-mpi-benchmarks/2019.6-sl772ml'],
        },

    ],

Run and error:

(hpc-tests) [centos@eb-login-0 hpc-tests]$ reframe/bin/reframe -C rfm_config_simple.py -c apps/imb/ --run --performance-report --tag pingpong
reframe/bin/reframe: failed to load configuration: section 'environments' not defined for system 'arcus'

If I remove the comments on the empty imb environ then it runs both as expected:

- arcus:ib-gcc9-openmpi4-ucx
   - imb
      * num_tasks: 2
      * max_bandwidth: 11108.47 Mbytes/sec
      * min_latency: 0.96 t[usec]
- arcus:ib-gcc9-impi2019-mlx
   - imb
      * num_tasks: 2
      * max_bandwidth: 11184.88 Mbytes/sec
      * min_latency: 1.02 t[usec]
vkarak commented 4 years ago

I could reproduce this with with even a single environment:

    'systems': [
        {
            'name': 'tresa',
            'hostnames': ['.*'],
            'partitions': [
                {
                    'name': 'default',
                    'scheduler': 'local',
                    'launcher': 'local',
                    'environs': ['builtin'],
                    'container_platforms': [{'type': 'Docker'}],
                    'max_jobs': 8
                }
            ]
        },
    ],
    'environments': [
        {
            'name': 'builtin',
            'cc': 'cc',
            'cxx': '',
            'ftn': '',
            'target_systems': ['tresa:default']
        },
    ],
./bin/reframe -C config/tresa.py -l
./bin/reframe: failed to load configuration: section 'environments' not defined for system 'tresa'

Although I suspect why this is happening, this behaviour is not correct. I mark it as a bug.

teojgo commented 4 years ago

I could reproduce this with with even a single environment:

    'systems': [
        {
            'name': 'tresa',
            'hostnames': ['.*'],
            'partitions': [
                {
                    'name': 'default',
                    'scheduler': 'local',
                    'launcher': 'local',
                    'environs': ['builtin'],
                    'container_platforms': [{'type': 'Docker'}],
                    'max_jobs': 8
                }
            ]
        },
    ],
    'environments': [
        {
            'name': 'builtin',
            'cc': 'cc',
            'cxx': '',
            'ftn': '',
            'target_systems': ['tresa:local']
        },
    ],
./bin/reframe -C config/tresa.py -l
./bin/reframe: failed to load configuration: section 'environments' not defined for system 'tresa'

Although I suspect why this is happening, this behaviour is not correct. I mark it as a bug.

If you also add a bare tresa in the target_systems of the builtin environment the configuration load succeeds. The problem seems to be on: https://github.com/eth-cscs/reframe/blob/661dbc170cce67f416a7d6923bcc2e941bc26d35/reframe/core/config.py#L342

Where it searches for a bare tresa according to the fullname, it cannot find it and therefore the environments part is not populated.

vkarak commented 4 years ago

@teojgo I will get back to you shortly about what's the logic behind this behaviour. It explains both why it works with tresa only or with *, the default for target_systems.

vkarak commented 4 years ago

The logic behind this is that ReFrame when loading the configuration it calls select_subconfig(current_system) to set itself up for the current system, so essentially "instantiates" the configuration file for the current system and then validates it. When instantiates the configuration, it will try to find definitions for all the scoped keys inside the current scope, i.e., the current system. Therefore, it can't see what's defined inside a nested scope, as for example for each specific partition. That's why it works when target_systems is set to tresa in this example or it is left default. For the same reason, it works if --system=tresa:default is passed. Whatever we do to fix this, we should be careful with the logic behind it. Later on, ReFrame calls again select_subconfig() for each of the system partitions in order to get the partition-specific definitions. So if that step only existed, the problem would be solved, but I don't know if that solution is feasible.

teojgo commented 4 years ago

@vkarak would it make sense to have a default environment attribute in the configuration for a system?

sjpb commented 4 years ago

I'd note it is preferable (to me!) if a failure is generated if you have accidentally listed an environment under a systems' environs parameter, but then not actually defined that environment for that system. I'm not sure whether having a "default" environment would break that behaviour and silently run with that default environment definition instead.

http://stackhpc.com/ Please note I work Tuesday to Friday.

On Mon, 5 Oct 2020 at 14:00, Theofilos Manitaras notifications@github.com wrote:

@vkarak https://github.com/vkarak would it make sense to have a default environment attribute in the configuration for a system?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eth-cscs/reframe/issues/1453#issuecomment-703615792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH65TXT72WRQMZGZDMICJ4TSJG7QBANCNFSM4P6CUU2A .