reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
224 stars 104 forks source link

Access configuration objects before the setup phase #3323

Open gkaf89 opened 5 days ago

gkaf89 commented 5 days ago

When defining tests it can be useful to have access to the contents of the setup file. Consider for instance the following site configuration.

site_configuration = { 
    'systems': [
        {
            'name': 'aion',
            'hostnames': [r'aion-[0-9]{4}'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'batch',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'access': ['--partition=batch', '--qos=normal'],
                    'max_jobs': 8,
                    'environs': ['builtin', 'foss2023b'],
                    'extras' : {
                        'admissible_omp_num_threads' : [1, 2, 4, 8, 16],
                    },
                },
            ],
        },
    ]
}

We want to configure a test for the performance of some software based on the number of OpenMP threads:

class performance_test(rfm.RunOnlyRegressionTest):
    num_omp_threads = parameter(current_partition.extras['admissible_omp_num_threads'])

As far as I understand the parameters are expanded before reading the configuration file, and the resulting tests are filter with the contends of the configuration file. Could we somehow use the contends of the configuration file earlier, for instance by setting a callback in the parameter definition?

vkarak commented 3 days ago

As far as I understand the parameters are expanded before reading the configuration file, and the resulting tests are filter with the contends of the configuration file.

Nope, configuration is very first thing that is being resolved, before even tests are ever loaded. You can have access to the actual partition/environment combinations at the parameter definition, but this is currently through an internal interface. The plan is to expose this and add examples in the documentation. This is how you can achieve your goal:

from reframe.core.runtime import valid_sysenv_comb

def admissible_omp_num_threads(valid_systems, valid_prog_environs):
    for part, _ in valid_sysenv_comb(valid_systems, valid_prog_environs):
        yield part.extras.get('admissible_omp_num_threads', []), part

class performance_test(rfm.RunOnlyRegressionTest):
    valid_systems = ['...']
    valid_prog_environs = ['...']
    num_omp_threads = parameter(admissible_omp_num_threads(valid_systems, valid_prog_environs), fmt=lambda x: x[0])

    @run_after('init')
    def restrict_valid_systems(self):
        self.valid_systems = admissible_omp_num_threads[1]

The valid_sysenv_comb interprets the partition/environment constraints and gives you all the valid combinations for this test. However, since the extras value will likely be different for each of the valid partitions, you need to store this information and restrict in a post-init hook the particular test variant to its corresponding system.

Since this is a recurring pattern, e.g., wanting to parameterise a test over some other system info (such as sockets, number of GPUs), it's something we would like to expose in an easier way.

gkaf89 commented 12 hours ago

Thanks for the pointers!

The need to account for the partition complicates the process, but the valid_sysenv_comb function exports all the necessary information. I am not sure how the process can be simplified. Here is an example of how I used the interface exposed by valid_sysenv_comb.

1. The system configuration

site_configuration = {
    'general': [
        {
            'use_login_shell': True,
        }
    ],
    'systems': [
        {
            'name': 'aion',
            'descr': 'Aion cluster',
            'hostnames': [r'aion-[0-9]{4}'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'batch',
                    'descr': 'Aion batch partition',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'access': ['--partition=batch', '--qos=normal'],
                    'max_jobs':  8,
                    'environs': ['builtin', 'foss2023b'],
                    'extras' : {
                        'sockets_per_node' : 8,
                        'cores_per_socket' : 16,
                        'admissible_setups' : {
                          'omp_num_threads' : [1, 2, 4, 8, 16],
                          'num_nodes' : [1, 2, 4, 8, 16],
                        },
                    },
                },
            ],
        },
        {
            'name': 'iris',
            'descr': 'Iris cluster',
            'hostnames': [r'iris-[0-9]{3}'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'batch',
                    'descr': 'Iris batch partition',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'access': ['--partition=batch', '--qos=normal'],
                    'max_jobs':  8,
                    'environs': ['builtin', 'foss2023b'],
                    'extras' : {
                        'sockets_per_node' : 2,
                        'cores_per_socket' : 14,
                        'admissible_setups' : {
                          'omp_num_threads' : [1, 7, 14],
                          'num_nodes' : [1, 2, 4, 8, 16],
                        },
                    },
                },
            ],
        },
     ],
     ...
}

2. The tests

class PartitionExtraProperty:
  def __init__(self, part, val):
    self.partition = part
    self.value = val

  def __str__(self):
    return f"{self.value}"

def parametrize_system_partition_property(
    valid_systems,
    valid_prog_environs,
    get_system_partition_property
  ):

  partition_extra_properties = []

  for part in valid_sysenv_comb(valid_systems, valid_prog_environs):
    prop = get_system_partition_property(part)
    partition_extra_properties.append( PartitionExtraProperty(part.name, prop) )

  return partition_extra_properties

def expand_partition_property_list( partition_extra_properties_list, reduce_list ):
  partition_property_list = []
  for partition_extra_property in partition_extra_properties_list:
    partition = partition_extra_property.partition
    value_list = partition_extra_property.value
    reduced_list = reduce_list(value_list)
    for prop in reduced_list:
      yield PartitionExtraProperty( partition, prop)

def get_admissible_omp_num_threads(partition):
  return partition.extras.get('admissible_setups', None).get('omp_num_threads', [])

def get_admissible_num_nodes(partition):
  return partition.extras.get('admissible_setups', None).get('num_nodes', [])

class performance_test(rfm.RunOnlyRegressionTest):
  valid_systems = ['*']
  valid_prog_environs = ['+openmp +mpi']

  test_case = parameter()
  test_type = parameter()

  num_nodes = parameter()
  cpus_per_task = parameter()

  partition_num_nodes = parametrize_system_partition_property(
    valid_systems,
    valid_prog_environs,
    get_admissible_num_nodes
  )
  partition_cpus_per_task =  parametrize_system_partition_property(
    valid_systems,
    valid_prog_environs,
    get_admissible_omp_num_threads
  )

  @run_after('init')
  def restrict_valid_systems(self):
    valid_partitions = { self.num_nodes.partition } & { self.cpus_per_task.partition }
    self.valid_systems = [ f'*:{partition}' for partition in valid_partitions ]

    self.num_nodes = self.num_nodes.value
    self.cpus_per_task = self.cpus_per_task.value
...

@rfm.simple_test
class problem_size_scaling_test(performance_test):
  test_type = parameter( ['opt', 'dmc', 'vmc'] )
  test_case = parameter( ['W1', 'W5', 'W10', 'W15', 'W20', 'W25', 'W30'] )

  num_nodes = parameter(
    expand_partition_property_list(
      performance_test.partition_num_nodes,
      lambda x : x
    )
  )
  cpus_per_task = parameter(
    expand_partition_property_list(
      performance_test.partition_cpus_per_task,
      lambda x : [max(x)]
    )
  )

@rfm.simple_test
class ompmpi_ratio_test(performance_test):
  test_type = parameter( ['vmc'] )
  test_case = parameter( ['W1', 'W5', 'W10', 'W15', 'W20', 'W25', 'W30'] )

  num_nodes = parameter(
    expand_partition_property_list(
      performance_test.partition_num_nodes,
      lambda x : x
    )
  )
  cpus_per_task = parameter(
    expand_partition_property_list(
      performance_test.partition_cpus_per_task,
      lambda x : x
    )
  )

Notes

For the test parameters, I am abusing the system a bit by resetting the value of the parameter in restrict_valid_systems to remove the information about the partition and keep only the value of interest. I noticed that with this method setting the fmt entry of parameter results in errors; it seems that fmt is called before and after the @run_after('init') hook, so it will have to handle both formats. I chose to created a class PartitionExtraProperty to print the values of the parameters with its __str__ function instead of handling multiple types in fmt.