reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
222 stars 103 forks source link

Nodelist not properly logged in perflog #3018

Closed stefaniereuter closed 12 months ago

stefaniereuter commented 1 year ago

Hi I found this closed issue #2567 but I think I'm running into the same problem. Problem: if I run a run only regression test on multiple nodes I currently only get the first startup node logged in my reframe perflog. I read in the issue mentioned above that the issue was, that 'sacct' wasn't called multiple times. But by running in debug mode '-v -v -v' reframe seems to call 'sacct' multiple times during the run and if I call the same 'sacct' command in a second terminal I get the correct nodelist.
I have the following setup: scheduler: slurm, launcher:mpirun ReFrame-HPC 4.4.0

 'logging': [
        {
            'level': 'debug',
            'handlers': [
                {
                    'type': 'stream',
                    'name': 'stdout',
                    'level': 'info',
                    'format': '%(message)s'
                },
                {
                    'type': 'file',
                    'level': 'debug',
                    'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s',   # noqa: E501
                    'append': False
                }
            ],
            'handlers_perflog': [
                {
                    'type': 'httpjson',
                    'url': <url>,
                    'level': 'info'
                },
                {
                    'type': 'filelog',
                    'prefix': '%(check_system)s/%(check_partition)s',
                    'level': 'info',
                    'format': (
                        '%(check_job_completion_time)s|'
                        'reframe %(version)s|'
                        '%(check_info)s|'
                        '%(check_jobid)s|'
                        '%(check_job_nodelist)s|'
                        '%(check_num_tasks)s|'
                        '%(check_num_cpus_per_task)s|'
                        '%(check_num_tasks_per_node)s|'
                        '%(check_num_gpus_per_node)s|'
                        '%(check_perfvalues)s|'
                        '%(check_spack_spec)s|'
                        '%(check_display_name)s|'
                        '%(check_system)s|'
                        '%(check_partition)s|'
                        '%(check_environ)s|'                       
                        '%(check_extra_resources)s|'
                        '%(check_env_vars)s|'
                        '%(check_tags)s'
                    ),
                    'format_perfvars': (
                        '%(check_perf_value)s|'
                        '%(check_perf_unit)s|'
                        '%(check_perf_ref)s|'
                        '%(check_perf_lower_thres)s|'
                        '%(check_perf_upper_thres)s|'
                    ),
                    'append': True
                }
            ]
        }
    ],
vkarak commented 1 year ago

Could you try printing the node list in a per-performance hook in your test?

@run_before('performance')
def print_job_nodelist(self):
    print(self.job.nodelist)

I'm not sure that the problem is in the way we poll Slurm as we get the Slurm's reported nodespec every time we poll.

vkarak commented 1 year ago

I think the problem comes from logging that triggers the evaluation of the nodelist here too early, before the final node list populated:

https://github.com/reframe-hpc/reframe/blob/b1c89701d37f9378cf010f9271c7be312cb30e17/reframe/core/schedulers/slurm.py#L83-L87