reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
214 stars 101 forks source link

Strange behaviour when using conditional dependency #3236

Open paulmelis opened 1 month ago

paulmelis commented 1 month ago

I have a test with an optional dependency, see below. I use the Blender_CompileShaders test to force a one-time action (when the NVIDIA driver has changed) of compiling NVIDIA shaders before rendering, which can take quite some time, and I don't want that time to pollute the actual render results of Blender_RIOW. But I do want to keep track of the precompile time, hence having it as a separate test that gets logged.

@rfm.simple_test
class Blender_CompileShaders(rfm.RunOnlyRegressionTest):

    descr = 'Force Blender CUDA shader compilation'

    valid_systems = ['snellius:gpu_a100', 'snellius:gpu_h100']
    ...

class BlenderTestBase(rfm.RunOnlyRegressionTest):

    descr = 'Blender %s render benchmark' % BLENDER_VERSION

    valid_systems = [
        'snellius:rome', 'snellius:genoa', 'snellius:fat', 'snellius:gpu_a100', 'snellius:gpu_h100', 'snellius:himem_4tb', 'snellius:himem_8tb'
    ]

    ...

def dep_gpu_only(src, dst):
    print(src, dst, dst[0].startswith('gpu_'))
    return dst[0].startswith('gpu_')

@rfm.simple_test
class Blender_RIOW(BlenderTestBase):

    descr = 'Blender render benchmark'

    @run_after('init')
    def inject_dependencies(self):
        self.depends_on('Blender_CompileShaders', how=dep_gpu_only)

    ....

The funky thing here is that the Blender_RIOW test is run on all of our nodes, including non-GPU ones, while the Blender_CompileShaders dependency only makes sense on GPU nodes. Hence the valid_systems = ['snellius:gpu_a100', 'snellius:gpu_h100'] in that class.

However, this seems to trip up Reframe somewhat. When I run the test on a GPU node all is well and I can see the dep_gpu_only() call being made and returning True:

snellius paulm@int4 08:59 ~/reframe-surf$ reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:gpu_a100 -r -n 'Blender_CompileShaders' -n 'Blender_RIOW'
[ReFrame Setup]
  version:           4.6.1
  command:           '/sw/arch/RHEL8/EB_production/2023/software/ReFrame/4.6.1/bin/reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:gpu_a100 -r -n Blender_CompileShaders -n Blender_RIOW'
  launched by:       paulm@int4
  working directory: '/gpfs/home4/paulm/reframe-surf'
  settings files:    '<builtin>', 'settings_files/settings.py'
  check search path: (R) '/gpfs/home4/paulm/reframe-surf/production_tests'
  stage directory:   '/scratch-shared/paulm/reframe_output/staging/2024-07-16_08-59-27'
  output directory:  '/home/paulm/.reframe/production/output/2024-07-16_08-59-27'
  log files:         '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'
('gpu_a100', 'eb-foss') ('gpu_a100', 'eb-foss') True
('gpu_a100', 'eb-foss') ('gpu_a100', 'eb-foss') True
[==========] Running 2 check(s)
[==========] Started on Tue Jul 16 08:59:42 2024+0200

[----------] start processing checks
[ RUN      ] Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100+eb-foss
 [       OK ] (1/2) Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100+eb-foss
P: kernel_loading: 0.45999999999999996 s (r:0, l:None, u:None)
[ RUN      ] Blender_RIOW /214f6d42 @snellius:gpu_a100+eb-foss
[       OK ] (2/2) Blender_RIOW /214f6d42 @snellius:gpu_a100+eb-foss
P: render: 6.28 s (r:0, l:None, u:None)
P: max_error: 0.00784314 unitless (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Tue Jul 16 09:01:00 2024+0200

===============================================================================================================================================================================
PERFORMANCE REPORT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[Blender_CompileShaders /ed1c9d95 @snellius:gpu_a100:eb-foss]
  num_tasks_per_node: 1
  num_gpus_per_node: 4
  num_cpus_per_task: 72
  num_tasks: 1
  performance:
    - kernel_loading: 0.45999999999999996 s (r: 0 s l: -inf% u: +inf%)
[Blender_RIOW /214f6d42 @snellius:gpu_a100:eb-foss]
  num_tasks_per_node: 1
  num_gpus_per_node: 4
  num_cpus_per_task: 72
  num_tasks: 1
  performance:
    - render: 6.28 s (r: 0 s l: -inf% u: +inf%)
    - max_error: 0.00784314 unitless (r: 0 unitless l: -inf% u: +inf%)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

But when I run it on a non-GPU node I get warnings related to dependency resolution, dep_gpu_only() never gets called, and two tests are (incorrectly) skipped:

snellius paulm@int4 09:01 ~/reframe-surf$ reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:genoa -r -n 'Blender_CompileShaders' -n 'Blender_RIOW'
[ReFrame Setup]
  version:           4.6.1
  command:           '/sw/arch/RHEL8/EB_production/2023/software/ReFrame/4.6.1/bin/reframe -C settings_files/settings.py -c production_tests --mode=production --system snellius:genoa -r -n Blender_CompileShaders -n Blender_RIOW'
  launched by:       paulm@int4
  working directory: '/gpfs/home4/paulm/reframe-surf'
  settings files:    '<builtin>', 'settings_files/settings.py'
  check search path: (R) '/gpfs/home4/paulm/reframe-surf/production_tests'
  stage directory:   '/scratch-shared/paulm/reframe_output/staging/2024-07-16_09-02-09'
  output directory:  '/home/paulm/.reframe/production/output/2024-07-16_09-02-09'
  log files:         '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

WARNING: could not resolve dependency: ('Blender_RIOW', 'snellius:genoa', 'eb-foss') -> 'Blender_CompileShaders'
WARNING: could not resolve dependency: ('Blender_HoleInTheRoof', 'snellius:genoa', 'eb-foss') -> 'Blender_CompileShaders'
WARNING: skipping all dependent test cases
  - ('Blender_RIOW', 'snellius:genoa', 'eb-foss')
  - ('Blender_HoleInTheRoof', 'snellius:genoa', 'eb-foss')

[==========] Running 0 check(s)
[==========] Started on Tue Jul 16 09:02:27 2024+0200

[----------] start processing checks
[----------] all spawned checks have finished

[  PASSED  ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Tue Jul 16 09:02:27 2024+0200

Log file(s) saved in '/gpfs/home4/paulm/reframe-surf/reframe.log', '/gpfs/home4/paulm/reframe-surf/reframe.out'

Now I can understand that Blender_CompileShaders gets filtered out due to its valid_systems not including the system I'm running the test on. But why would this cause the self.depends_on() in Blender_RIOW to not call dep_gpu_only() at all? Shouldn't it evaluate that function first, and only when the dependency is needed check if it can be found?

Also interesting to see it list the 2nd test case Blender_HoleInTheRoof in the output, which is indeed defined, but I don't ask for it with -n on the command-line.

This is with Reframe 4.6.1

Edit: some wording

paulmelis commented 1 month ago

Just noticed something, if I run tests on all defined node types (--system snellius in our case) which includes both CPU and GPU nodes, then the dependency warning is not shown and all tests are run. But if I only run on the CPU nodes (--system snellius:genoa, as used above in the second run) then the tests are skipped due to the dependency check failing.

teojgo commented 1 month ago

I tried to reproduce your above setup with some mock tests. The dep_gpu_only is indeed not reached because it seems that the Blender compilation is not valid for the system that you are trying to run in the first place. Can you try setting the valid systems to be the same for both tests and try to see what happens? This will allow to pass the filtering of ReFrame and run the hook.

paulmelis commented 1 month ago

I can confirm that changing valid_systems in Blender_CompileShaders to list the full range of node types indeed makes the non-GPU tests run without issue. It's slightly suboptimal, as it now makes that base test look like it is needed on the CPU nodes when it is not, but okay.

teojgo commented 1 month ago

@paulmelis yeah I don't know if there's a way to get a way with that since to it to reach the hook it should not be filtered out completely.

vkarak commented 1 month ago

This is generally a limitation with raw dependencies compared to using fixtures. Fixtures "inherit" the valid_systems and valid_prog_environs of their parent based on their scope, so you don't run into the trouble or properly hand-crafting the valid_systems in your dependencies. My suggestion may require a bit more work but I think it will pay off in the future:

  1. Avoid hard-coding the system names in the valid_systems and similarly for environments in valid_prog_environs. For each system partition and environment in your config, define instead a set of features and extras (see here) which you can use as constraints in your valid_systems and valid_prog_environs.
  2. Make the Blender_CompileShaders a fixture of the Blender_RIOW using the environment scope.