reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
220 stars 103 forks source link

A failed attempt to do remote_detec doresn't properly cleanup the temp workdir. #2754

Closed akesandgren closed 1 year ago

akesandgren commented 1 year ago

In the _remote_detect(part) code the "with _copy_reframe(prefix) as dirname:" doesn't properly execute the exit part of __copy_reframe if there is an error and thus the temp workdir is left behind.

Typically happens in a EasyBuild installation where "os.path.join(rfm.INSTALL_PREFIX, p)" for p == "bin/" doesn't exist and thus the osext.copytree fails.

vkarak commented 1 year ago

Could you copy paste the error that you see? Or provide a reproducer somehow. It's very strange that the temporary directory is left behind, cos the __exit__() method is supposed to execute always even in case of exceptions.

akesandgren commented 1 year ago

Do an EasyBuild install of ReFrame 4.0.1, turn on remote_detect, see it fail and see the workdir being left behind. Here is what I get:

[akesa@alvis1 hpc2n-reframe-tests]$ ls -lart
total 3
-rw-rw-r--  1 akesa akesa        1555 Aug 16 08:53 LICENSE
drwxrwxr-x  6 akesa akesa      180152 Nov 14 13:17 checks/
-rw-rw-r--  1 akesa akesa         330 Jan 23 08:09 README.md
drwxrwxr-x 10 akesa akesa 10204150484 Jan 23 13:22 ../
drwxrwxr-x  3 akesa akesa       49431 Jan 23 16:20 config/
-rw-rw-r--  1 akesa akesa          33 Jan 23 16:21 .gitignore
drwxrwxr-x  8 akesa akesa      416602 Jan 23 16:21 .git/
drwxrwxr-x  3 akesa akesa        1669 Jan 23 16:33 perflogs/
drwxrwxr-x  3 akesa akesa           0 Jan 23 16:51 stage/
drwxrwxr-x  8 akesa akesa      652852 Jan 23 16:51 ./
drwxrwxr-x  3 akesa akesa        3080 Jan 23 16:51 output/
[akesa@alvis1 hpc2n-reframe-tests]$ reframe -C config/hpc2n+c3se-settings.py -n new_gpu_burn_check -l --system alvis:8xT4
Detecting topology of remote partition 'alvis:8xT4': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: '/apps/Common/software/ReFrame/4.0.1/lib/python3.6/site-packages/bin/'
[ReFrame Setup]
  version:           4.0.1
  command:           '/apps/Common/software/ReFrame/4.0.1/bin/reframe -C config/hpc2n+c3se-settings.py -n new_gpu_burn_check -l --system alvis:8xT4'
  launched by:       akesa@alvis1
  working directory: '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests'
  settings files:    '<builtin>', 'config/hpc2n+c3se-settings.py'
  check search path: (R) '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/checks'
  stage directory:   '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/stage'
  output directory:  '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/output'
  log files:         '/tmp/rfm-91r1m8eb.log'

[List of matched checks]
- new_gpu_burn_check %precision=single /da553a21
    ^gpu_burn_build ~alvis:8xT4+foss_with_cuda /6eef7363
- new_gpu_burn_check %precision=double /fe992350
    ^gpu_burn_build ~alvis:8xT4+foss_with_cuda /6eef7363
Found 2 check(s)

Log file(s) saved in '/tmp/rfm-91r1m8eb.log'
[akesa@alvis1 hpc2n-reframe-tests]$ ls -l
total 3
drwxrwxr-x 6 akesa akesa 180152 Nov 14 13:17 checks/
drwxrwxr-x 3 akesa akesa  49431 Jan 23 16:20 config/
-rw-rw-r-- 1 akesa akesa   1555 Aug 16 08:53 LICENSE
drwxrwxr-x 3 akesa akesa   3080 Jan 23 16:51 output/
drwxrwxr-x 3 akesa akesa   1669 Jan 23 16:33 perflogs/
-rw-rw-r-- 1 akesa akesa    330 Jan 23 08:09 README.md
drwx------ 2 akesa akesa      0 Jan 24 07:02 rfm.8mdwzeac/
drwxrwxr-x 3 akesa akesa      0 Jan 23 16:51 stage/

Note the still existing rfm.... workdir from remote_detect

jack-morrison commented 1 year ago

Typically happens in a EasyBuild installation where "os.path.join(rfm.INSTALL_PREFIX, p)" for p == "bin/" doesn't exist and thus the osext.copytree fails.

This seems to be reproducible for pip installations as well since the directory structure doesn't match.

vkarak commented 1 year ago

Maybe duplicate of #2914.

vkarak commented 1 year ago

Although not so relevant as of #2978, it can still happen if there is a permission error or any other error. The problem is that the exception happens during __enter__, i.e., before entering the with region, so the corresponding __exit__ method of the context manager is not being executed.