reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
214 stars 101 forks source link

get error when add "resources" in the configuration file #3116

Closed jigo3635 closed 6 months ago

jigo3635 commented 6 months ago

Hi,

The slurm flag --gpus-per-nodeshould be added to one job script to get GPU resources.

#SBATCH --gpus-per-node=8 I tried to add the flag as resources according to

[https://reframe-hpc.readthedocs.io/en/v2.20/configure.html]()https://reframe-hpc.readthedocs.io/en/v2.20/configure.html

'resources': {
     'gpu': ['--gpus-per-node={num_gpus_per_node}']
}

but get error

ERROR: failed to load configuration: could not validate configuration files: '['', './reframe/cluster_settings.py']': 'name' is a required property

if I add

'resources': {
    'name': 'gpu_resource',
     'gpu': ['--gpus-per-node={num_gpus_per_node}']
}

there is another error,

ERROR: failed to load configuration: could not validate configuration files: '['', './reframe/cluster_settings.py']': Additional properties are not allowed ('gpu' was unexpected)

How I can add slurm flag to one partition in the configuration file ? Thanks.

vkarak commented 6 months ago

I see you're using a quite old version. This is a bug in the docs: resources is a list of objects, so you should rather write it as:

'resources': [{
    'name': 'gpu_resource',
     'gpu': ['--gpus-per-node={num_gpus_per_node}']
}]
jigo3635 commented 6 months ago

Hi @vkarak,

I did

'resources': [{
                          'name': 'gpu_resource',
                          'gpu': ['--gpus-per-node={num_gpus_per_node}']
                    }]

but still got error

ERROR: failed to load configuration: could not validate configuration files: '['', './reframe/cluster_settings.py']': Additional properties are not allowed ('gpu' was unexpected)

Failed validating 'additionalProperties' in schema['properties']['systems']['items']['properties']['partitions']['items']['properties']['resources']['items']: {'additionalProperties': False, 'properties': {'name': {'type': 'string'}, 'options': {'items': {'type': 'string'}, 'type': 'array'}}, 'required': ['name'], 'type': 'object'}

On instance['systems'][0]['partitions'][1]['resources'][0]: {'gpu': ['--gpus-per-node={num_gpus_per_node}'], 'name': 'gpu_resource'}

Is the place under site_configuration -> systems -> partitions correct ? Thanks.

vkarak commented 6 months ago

Which reframe version are you using? Maybe the docs you are looking at are for an older version than the one you are running.

jigo3635 commented 6 months ago

@vkarak ,

Just did "git clone"

../reframe_git/bin/reframe -V
4.6.0-dev.0+e4f29181
vkarak commented 6 months ago

Ok, so you're looking at a very outdated documentation :-) Check here on how to set it:

https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#custom-job-scheduler-resources

jigo3635 commented 6 months ago

Hi @vkarak,

Thanks for the instruction. there is not error with the configuration file but the options seems not to take effect.

> cat rfm_job.sh 
#!/bin/bash
#SBATCH --job-name="rfm_Tgv32_0e063c7a"
#SBATCH --ntasks=8
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=1:0:0
export MPICH_GPU_SUPPORT_ENABLED=1
...
vkarak commented 6 months ago

Do you select them with the extra_resources in your test? Defining them only in the configuration file is not sufficient.

jigo3635 commented 6 months ago

Hi @vkarak,

The extra_resources is added to the Test with

extra_resources = {
        'name': 'gpu',
        'options': ['--gres=gpu:{num_gpus_per_node}'] }

then got typeerror

TypeError: failed to set field 'extra_resources': '{'name': 'gpu', 'options': ['--gres=gpu:{num_gpus_per_node}']}' is not of type 'Dict[str,Dict[str,object]]'

Tried withextra_resources=[{...}]but it is definitely not python dict.

vkarak commented 6 months ago

There are two parts. First is the configuration where you need to add this to your partition:

'resources': [
    {
        'name': 'gpu',
        'options': ['--gres=gpu:{num_gpus_per_node}']
    }
]

Second is your test where you need to add this:

extra_resources = {'gpu': {'num_gpus_per_node': '8'}}

In your case you're trying to pass the value that you would put in the configuration to the extra_resources of the test, that's why reframe complains.

jigo3635 commented 6 months ago

Hi @vkarak,

It works fine now. Thanks.