reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
214 stars 102 forks source link

Jobs fail on First time due to no output file #2518

Closed mtblondie closed 2 years ago

mtblondie commented 2 years ago

I have set up an environment to test for production, and I'm following the tutorial with the exception of using my environment variables. When I run jobs, I get this error - Reason: sanity error: rfm_HelloTest_job.out: No such file or directory rfm-e07l3pjm.log run-report.txt

all the staging files are there and have no errors - but haven't been moved to output. The directories have been created but they are empty.

[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/gnu/HelloTest/*
-rw-rw-r-- 1 byrnka byrnka   331 May 10 11:45 stage/lemhi2/mc/gnu/HelloTest/hello.c
-rw-rw-r-- 1 byrnka byrnka   285 May 10 11:45 stage/lemhi2/mc/gnu/HelloTest/hello.cpp
-rwxrw-r-- 1 byrnka byrnka   223 May 16 14:16 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_build.sh
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_build.err
-rwxrwxr-x 1 byrnka byrnka 17288 May 16 14:16 stage/lemhi2/mc/gnu/HelloTest/HelloTest
-rwxrw-r-- 1 byrnka byrnka   272 May 16 14:17 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_job.sh
-rw------- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_job.err
-rw------- 1 byrnka byrnka    14 May 16 14:17 stage/lemhi2/mc/gnu/HelloTest/rfm_HelloTest_job.out
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/intel/HelloTest/*
-rw-rw-r-- 1 byrnka byrnka   331 May 10 11:45 stage/lemhi2/mc/intel/HelloTest/hello.c
-rw-rw-r-- 1 byrnka byrnka   285 May 10 11:45 stage/lemhi2/mc/intel/HelloTest/hello.cpp
-rwxrw-r-- 1 byrnka byrnka   277 May 16 14:16 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_build.sh
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_build.err
-rwxrwxr-x 1 byrnka byrnka 40320 May 16 14:16 stage/lemhi2/mc/intel/HelloTest/HelloTest
-rwxrw-r-- 1 byrnka byrnka   328 May 16 14:17 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_job.sh
-rw------- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_job.err
-rw------- 1 byrnka byrnka    14 May 16 14:17 stage/lemhi2/mc/intel/HelloTest/rfm_HelloTest_job.out
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/gnu_mpi/HelloTest/*
-rw-rw-r-- 1 byrnka byrnka   331 May 10 11:45 stage/lemhi2/mc/gnu_mpi/HelloTest/hello.c
-rw-rw-r-- 1 byrnka byrnka   285 May 10 11:45 stage/lemhi2/mc/gnu_mpi/HelloTest/hello.cpp
-rwxrw-r-- 1 byrnka byrnka   266 May 16 14:16 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_build.sh
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:16 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_build.err
-rwxrwxr-x 1 byrnka byrnka 17288 May 16 14:16 stage/lemhi2/mc/gnu_mpi/HelloTest/HelloTest
-rwxrw-r-- 1 byrnka byrnka   317 May 16 14:17 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_job.sh
-rw------- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_job.err
-rw------- 1 byrnka byrnka    14 May 16 14:17 stage/lemhi2/mc/gnu_mpi/HelloTest/rfm_HelloTest_job.out
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/intel_mpi/HelloTest/*
-rw-rw-r-- 1 byrnka byrnka   331 May 10 11:45 stage/lemhi2/mc/intel_mpi/HelloTest/hello.c
-rw-rw-r-- 1 byrnka byrnka   285 May 10 11:45 stage/lemhi2/mc/intel_mpi/HelloTest/hello.cpp
-rwxrw-r-- 1 byrnka byrnka   336 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_build.sh
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_build.err
-rwxrwxr-x 1 byrnka byrnka 40320 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/HelloTest
-rwxrw-r-- 1 byrnka byrnka   388 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_job.sh
-rw------- 1 byrnka byrnka     0 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_job.err
-rw------- 1 byrnka byrnka    14 May 16 14:17 stage/lemhi2/mc/intel_mpi/HelloTest/rfm_HelloTest_job.out

if I change nothing and rerun the exact same job, with don't-restage, it completes successfully.

./bin/reframe -c tutorials/basics/hello/hello1.py -r --dont-restage
[ReFrame Setup]
  version:           3.11.0
  command:           './bin/reframe -c tutorials/basics/hello/hello1.py -r --dont-restage'
  launched by:       byrnka@lemhi2
  working directory: '/home/byrnka/reframe'
  settings file:     '/home/byrnka/reframe/tutorials/config/settings.py'
  check search path: '/home/byrnka/reframe/tutorials/basics/hello/hello1.py'
  stage directory:   '/home/byrnka/reframe/stage'
  output directory:  '/home/byrnka/reframe/output'

[==========] Running 1 check(s)
[==========] Started on Mon May 16 14:29:46 2022

[----------] start processing checks
[ RUN      ] HelloTest @lemhi2:login+builtin
[ RUN      ] HelloTest @lemhi2:login+gnu
[ RUN      ] HelloTest @lemhi2:mc+gnu
[ RUN      ] HelloTest @lemhi2:mc+intel
[ RUN      ] HelloTest @lemhi2:mc+gnu_mpi
[ RUN      ] HelloTest @lemhi2:mc+intel_mpi
[       OK ] (1/6) HelloTest @lemhi2:login+builtin
[       OK ] (2/6) HelloTest @lemhi2:login+gnu
[       OK ] (3/6) HelloTest @lemhi2:mc+gnu
[       OK ] (4/6) HelloTest @lemhi2:mc+intel
[       OK ] (5/6) HelloTest @lemhi2:mc+gnu_mpi
[       OK ] (6/6) HelloTest @lemhi2:mc+intel_mpi
[----------] all spawned checks have finished

[  PASSED  ] Ran 6/6 test case(s) from 1 check(s) (0 failure(s), 0 skipped)
[==========] Finished on Mon May 16 14:30:05 2022
Run report saved in '/home/byrnka/.reframe/reports/run-report.json'
Log file(s) saved in '/tmp/rfm-fhycc_7b.log'

run-report2.txt rfm-fhycc_7b.log

byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/gnu/
total 64
drwxrwxr-x 6 byrnka byrnka 96 May 16 14:16 ..
drwxrwxr-x 2 byrnka byrnka  0 May 16 14:30 .
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/gnu_mpi/
total 64
drwxrwxr-x 6 byrnka byrnka 96 May 16 14:16 ..
drwxrwxr-x 2 byrnka byrnka  0 May 16 14:30 .
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/intel
total 64
drwxrwxr-x 6 byrnka byrnka 96 May 16 14:16 ..
drwxrwxr-x 2 byrnka byrnka  0 May 16 14:30 .
[byrnka@lemhi2 reframe]$ ls -latr stage/lemhi2/mc/intel_mpi/
total 64
drwxrwxr-x 6 byrnka byrnka 96 May 16 14:16 ..
drwxrwxr-x 2 byrnka byrnka  0 May 16 14:30 .

[byrnka@lemhi2 reframe]$ ls -latr output/lemhi2/mc/gnu/HelloTest
total 280
drwxrwxr-x 3 byrnka byrnka  27 May 16 14:29 ..
-rw------- 1 byrnka byrnka  14 May 16 14:30 rfm_HelloTest_job.out
-rw------- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_job.err
-rwxrw-r-- 1 byrnka byrnka 272 May 16 14:30 rfm_HelloTest_job.sh
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.err
drwxrwxr-x 2 byrnka byrnka 238 May 16 14:30 .
-rwxrw-r-- 1 byrnka byrnka 223 May 16 14:30 rfm_HelloTest_build.sh
[byrnka@lemhi2 reframe]$ ls -latr output/lemhi2/mc/gnu_mpi/HelloTest
total 280
drwxrwxr-x 3 byrnka byrnka  27 May 16 14:29 ..
-rw------- 1 byrnka byrnka  14 May 16 14:30 rfm_HelloTest_job.out
-rw------- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_job.err
-rwxrw-r-- 1 byrnka byrnka 317 May 16 14:30 rfm_HelloTest_job.sh
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.err
drwxrwxr-x 2 byrnka byrnka 238 May 16 14:30 .
-rwxrw-r-- 1 byrnka byrnka 266 May 16 14:30 rfm_HelloTest_build.sh
[byrnka@lemhi2 reframe]$ ls -latr output/lemhi2/mc/intel/HelloTest
total 280
drwxrwxr-x 3 byrnka byrnka  27 May 16 14:29 ..
-rw------- 1 byrnka byrnka  14 May 16 14:30 rfm_HelloTest_job.out
-rw------- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_job.err
-rwxrw-r-- 1 byrnka byrnka 328 May 16 14:30 rfm_HelloTest_job.sh
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.err
drwxrwxr-x 2 byrnka byrnka 238 May 16 14:30 .
-rwxrw-r-- 1 byrnka byrnka 277 May 16 14:30 rfm_HelloTest_build.sh
[byrnka@lemhi2 reframe]$ ls -latr output/lemhi2/mc/intel_mpi/HelloTest
total 280
drwxrwxr-x 3 byrnka byrnka  27 May 16 14:29 ..
-rw------- 1 byrnka byrnka  14 May 16 14:30 rfm_HelloTest_job.out
-rw------- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_job.err
-rwxrw-r-- 1 byrnka byrnka 388 May 16 14:30 rfm_HelloTest_job.sh
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.out
-rw-rw-r-- 1 byrnka byrnka   0 May 16 14:30 rfm_HelloTest_build.err
drwxrwxr-x 2 byrnka byrnka 238 May 16 14:30 .
-rwxrw-r-- 1 byrnka byrnka 336 May 16 14:30 rfm_HelloTest_build.sh

This occurs on all tests -

ekouts commented 2 years ago

The local jobs finish successfully so it looks like a bug with the pbs scheduler. ReFrame perceives the jobs finished before the output file is created. Could you please rerun with -vvv to see if we get more information?

mtblondie commented 2 years ago

absolutely - here's the tmp and report output files run-report3.txt rfm-17f4z_7m.txt

ekouts commented 2 years ago

Thanks @mtblondie , it looks like it was indeed a bug in the scheduler. I made a PR that I think will fix it, please give it a try and let us know if the issue persists.

mtblondie commented 2 years ago

I applied the patch and the issue is resolved. Everything ran as expected on first run, dumped the output to staging and then moved it to output. Files exist in output and are removed from staging. Thank you !!!! do you need thing else from me?

vkarak commented 2 years ago

Perfect, thanks for testing! We don't need anything else from you. The issue will close automatically as soon as we merge the PR.

mtblondie commented 2 years ago

Great! Thanks so much.