Closed akesandgren closed 1 year ago
Could you copy paste the error that you see? Or provide a reproducer somehow. It's very strange that the temporary directory is left behind, cos the __exit__()
method is supposed to execute always even in case of exceptions.
Do an EasyBuild install of ReFrame 4.0.1, turn on remote_detect, see it fail and see the workdir being left behind. Here is what I get:
[akesa@alvis1 hpc2n-reframe-tests]$ ls -lart
total 3
-rw-rw-r-- 1 akesa akesa 1555 Aug 16 08:53 LICENSE
drwxrwxr-x 6 akesa akesa 180152 Nov 14 13:17 checks/
-rw-rw-r-- 1 akesa akesa 330 Jan 23 08:09 README.md
drwxrwxr-x 10 akesa akesa 10204150484 Jan 23 13:22 ../
drwxrwxr-x 3 akesa akesa 49431 Jan 23 16:20 config/
-rw-rw-r-- 1 akesa akesa 33 Jan 23 16:21 .gitignore
drwxrwxr-x 8 akesa akesa 416602 Jan 23 16:21 .git/
drwxrwxr-x 3 akesa akesa 1669 Jan 23 16:33 perflogs/
drwxrwxr-x 3 akesa akesa 0 Jan 23 16:51 stage/
drwxrwxr-x 8 akesa akesa 652852 Jan 23 16:51 ./
drwxrwxr-x 3 akesa akesa 3080 Jan 23 16:51 output/
[akesa@alvis1 hpc2n-reframe-tests]$ reframe -C config/hpc2n+c3se-settings.py -n new_gpu_burn_check -l --system alvis:8xT4
Detecting topology of remote partition 'alvis:8xT4': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: '/apps/Common/software/ReFrame/4.0.1/lib/python3.6/site-packages/bin/'
[ReFrame Setup]
version: 4.0.1
command: '/apps/Common/software/ReFrame/4.0.1/bin/reframe -C config/hpc2n+c3se-settings.py -n new_gpu_burn_check -l --system alvis:8xT4'
launched by: akesa@alvis1
working directory: '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests'
settings files: '<builtin>', 'config/hpc2n+c3se-settings.py'
check search path: (R) '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/checks'
stage directory: '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/stage'
output directory: '/cephyr/NOBACKUP/users/akesa/reframe/test-runs/hpc2n-reframe-tests/output'
log files: '/tmp/rfm-91r1m8eb.log'
[List of matched checks]
- new_gpu_burn_check %precision=single /da553a21
^gpu_burn_build ~alvis:8xT4+foss_with_cuda /6eef7363
- new_gpu_burn_check %precision=double /fe992350
^gpu_burn_build ~alvis:8xT4+foss_with_cuda /6eef7363
Found 2 check(s)
Log file(s) saved in '/tmp/rfm-91r1m8eb.log'
[akesa@alvis1 hpc2n-reframe-tests]$ ls -l
total 3
drwxrwxr-x 6 akesa akesa 180152 Nov 14 13:17 checks/
drwxrwxr-x 3 akesa akesa 49431 Jan 23 16:20 config/
-rw-rw-r-- 1 akesa akesa 1555 Aug 16 08:53 LICENSE
drwxrwxr-x 3 akesa akesa 3080 Jan 23 16:51 output/
drwxrwxr-x 3 akesa akesa 1669 Jan 23 16:33 perflogs/
-rw-rw-r-- 1 akesa akesa 330 Jan 23 08:09 README.md
drwx------ 2 akesa akesa 0 Jan 24 07:02 rfm.8mdwzeac/
drwxrwxr-x 3 akesa akesa 0 Jan 23 16:51 stage/
Note the still existing rfm.... workdir from remote_detect
Typically happens in a EasyBuild installation where "os.path.join(rfm.INSTALL_PREFIX, p)" for p == "bin/" doesn't exist and thus the osext.copytree fails.
This seems to be reproducible for pip
installations as well since the directory structure doesn't match.
Maybe duplicate of #2914.
Although not so relevant as of #2978, it can still happen if there is a permission error or any other error. The problem is that the exception happens during __enter__
, i.e., before entering the with
region, so the corresponding __exit__
method of the context manager is not being executed.
In the _remote_detect(part) code the "with _copy_reframe(prefix) as dirname:" doesn't properly execute the exit part of __copy_reframe if there is an error and thus the temp workdir is left behind.
Typically happens in a EasyBuild installation where "os.path.join(rfm.INSTALL_PREFIX, p)" for p == "bin/" doesn't exist and thus the osext.copytree fails.