pyushkevich / ashs

Automatic Segmentation of Hippocampal Subfields
http://www.nitrc.org/projects/ashs
GNU General Public License v3.0
16 stars 4 forks source link

Error only when using SLURM (-S) with ASHS 2.0.0: Validity check at end of stage 2 detected missing files #9

Open npavlovikj opened 1 year ago

npavlovikj commented 1 year ago

Hi,

I recently downloaded ASHS 2.0.0 (the release from March 2 2022). I ran ashs_main.sh with the test data you provide and the UPENN PMC Atlas (_ashs_atlas_upennpmc20170810) on our HPC Center that supports Slurm.

When I run _ashsmain.sh both with the serial option and the parallel option (-P): ashs_main.sh -I sub07 -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm or ashs_main.sh -I sub07 -P -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm the test run is successful (please see the generated log here, https://gist.github.com/npavlovikj/9b089f11283ed98dbe1cfddfa6d6a6b2).

However, when I run _ashsmain.sh with the Slurm option (-S) on the same dataset with: ashs_main.sh -I sub07 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel the run fails (please see the generated log here, https://gist.github.com/npavlovikj/4d22d26e713d7406961ee42540927515) with:

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

When I submit the job, I can see many **ashs_mul*** jobs being submitted and running, but ultimately the run fails with the error from above.

Do you know why this error happens only when I use the Slurm option? Also, do you have any suggestions on how to fix it?

Please let me know if you need any additional information.

Thank you, Natasha

pyushkevich commented 1 year ago

Hi Natasha,

When you run on Slurm, there should be a bunch of output files generated in the dump folder of the work directory. Please check those files for errors, perhaps there is some library missing on one of the slurm nodes, or there is an error invoking slurm in the first place.

Paul

On Tue, Aug 8, 2023 at 2:08 PM Natasha Pavlovikj @.***> wrote:

Hi,

I recently downloaded ASHS 2.0.0 (the release from March 2 2022). I ran ashs_main.sh with the test data you provide https://github.com/pyushkevich/ashs/tree/master/testing/atlas_system_test/images and the UPENN PMC Atlas (ashs_atlas_upennpmc_20170810) on our HPC Center that supports Slurm.

When I run ashs_main.sh both with the serial option and the parallel option (-P): ashs_main.sh -I sub07 -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm or ashs_main.sh -I sub07 -P -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm the test run is successful (please see the generated log here, https://gist.github.com/npavlovikj/9b089f11283ed98dbe1cfddfa6d6a6b2).

However, when I run ashs_main.sh with the Slurm option (-S) on the same dataset with: ashs_main.sh -I sub07 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel the run fails (please see the generated log here, https://gist.github.com/npavlovikj/4d22d26e713d7406961ee42540927515) with:

**** !! ERROR !! *** Validity check at end of stage 2 detected missing files. (multiatlas/tseg_left_train000/greedy_atlas_to_s ubj_affine.mat and 231 other files).


When I submit the job, I can see many *ashs_mul** jobs being submitted and running, but ultimately the run fails with the error from above.

Do you know why this error happens only when I use the Slurm option? Also, do you have any suggestions on how to fix it?

Please let me know if you need any additional information.

Thank you, Natasha

— Reply to this email directly, view it on GitHub https://github.com/pyushkevich/ashs/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJPEWY2YWILVXCRUZYHJQTXUJ6CJANCNFSM6AAAAAA3I3ABEQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

npavlovikj commented 1 year ago

Hi Paul,

Thank you so much for your prompt reply!

There are indeed many _ashsstg2*.out files in the dump directory. However, I haven't been able to find any errors in them, and they all look similar (one example can be seen here https://gist.github.com/npavlovikj/3ab46c75146e539c248b642aeb58797b). The Slurm status and exit code for all those jobs is Completed and 0. I checked the used computational resources for all those jobs as well, and they are much lower than the ones I have requested.

Do you have any other suggestions?

Thank you, Natasha

pyushkevich commented 1 year ago

Hi Natasha,

The nodes seem to be successfully generating the .mat files that the parent script complains about not finding. Can you confirm that they are present in the filesystem?

I've seen something like this on one of our clusters, where files written to NFS from the nodes did not immediately show up on the submission host. Can you try running ASHS by stages (using -s 2 option, etc.) with a sleep command after each stage, to let NFS refresh? Hopefully this will do the trick.

Paul

Message ID: @.***>

npavlovikj commented 1 year ago

Hi Paul,

Thank you for the suggestion!

Yes, I can verify that the .mat files do exist on the cluster in the listed directories and have content in it. We have shared file system. I also applied your suggestion and ran ASHS in stages with a sleep command in between, e.g.,:

echo "Stage 1"
ashs_main.sh -I sub07 -s 1 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 2"
ashs_main.sh -I sub07 -s 2 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 3"
ashs_main.sh -I sub07 -s 3 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 4"
ashs_main.sh -I sub07 -s 4 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 5"
ashs_main.sh -I sub07 -s 5 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 6"
ashs_main.sh -I sub07 -s 6 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 7"
ashs_main.sh -I sub07 -s 7 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel

and I continue getting the Validity check error for Stage 2, and the consecutive stages afterwards:

Stage 1
ashs_main execution log
  timestamp:   Tue Aug  8 15:33:31 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 1 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
****************************************
Starting stage 1: Normalization to T1 population template
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

-------------------  INFO  ---------------------
Started stage 1: Normalization to T1 population 
template
------------------------------------------------

Submitted batch job 3388168

-------------------  INFO  ---------------------
Validity check at end of stage 1 successful
------------------------------------------------

Stage 2
ashs_main execution log
  timestamp:   Tue Aug  8 15:38:36 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 2 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

-------------------  INFO  ---------------------
Validity check at end of stage 1 successful
------------------------------------------------

****************************************
Starting stage 2: Initial ROI registration to all T2 atlases
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

-------------------  INFO  ---------------------
Started stage 2: Initial ROI registration to 
all T2 atlases
------------------------------------------------

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

Stage 3
ashs_main execution log
  timestamp:   Tue Aug  8 15:40:54 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 3 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train022/greedy_atlas_to_s
ubj_warp.nii.gz and 129 other files).
************************************************

Stage 4
ashs_main execution log
  timestamp:   Tue Aug  8 15:43:08 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 4 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 3 detected 
missing files. 
(multiatlas/tseg_right_train027/greedy_atlas_to_
subj_warp.nii.gz and 11 other files).
************************************************

Stage 5
ashs_main execution log
  timestamp:   Tue Aug  8 15:45:21 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 5 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 4 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 295 other files).
************************************************

Stage 6
ashs_main execution log
  timestamp:   Tue Aug  8 15:47:34 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 6 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 5 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 301 other files).
************************************************

Stage 7
ashs_main execution log
  timestamp:   Tue Aug  8 15:49:48 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 7 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 6 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 307 other files).
************************************************

The files for Stage 2 and Stage 3 that give the Validity check error do exist in the output directory:

[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz 
-rw-r--r-- 1 npavlovikj project 4211389 Aug  8 15:41 multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -ls multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
4465 -rw-r--r-- 1 npavlovikj project 4511258 Aug  8 15:43 multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat 
-rw-r--r-- 1 npavlovikj project 127 Aug  8 15:38 multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat

Do you have any other suggestions for me to try?

Thank you, Natasha

pyushkevich commented 1 year ago

It's really strange... Can you try to wait for longer after stage 2, maybe 10 minutes? Otherwise, could you add an echo command to the function in ashs bin/ashs_lib.sh directory that does the validity check, to have it print out the full path of each of the missing files? Maybe there is come other kind of mismatch with the filesystem...

On Tue, Aug 8, 2023 at 5:08 PM Natasha Pavlovikj @.***> wrote:

Hi Paul,

Thank you for the suggestion!

Yes, I can verify that the .mat files do exist on the cluster in the listed directories and have content in it. We have shared file system. I also applied your suggestion and ran ASHS in stages with a sleep command in between, e.g.,:

echo "Stage 1" ashs_main.sh -I sub07 -s 1 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 2" ashs_main.sh -I sub07 -s 2 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 3" ashs_main.sh -I sub07 -s 3 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 4" ashs_main.sh -I sub07 -s 4 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 5" ashs_main.sh -I sub07 -s 5 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 6" ashs_main.sh -I sub07 -s 6 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel sleep 130

echo "Stage 7" ashs_main.sh -I sub07 -s 7 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel

and I continue getting the Validity check error for Stage 2, and the consecutive stages afterwards:

Stage 1 ashs_main execution log timestamp: Tue Aug 8 15:33:31 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 1 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"


Starting stage 1: Normalization to T1 population template


SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

------------------- INFO --------------------- Started stage 1: Normalization to T1 population template

Submitted batch job 3388168

------------------- INFO --------------------- Validity check at end of stage 1 successful

Stage 2 ashs_main execution log timestamp: Tue Aug 8 15:38:36 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 2 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

------------------- INFO --------------------- Validity check at end of stage 1 successful


Starting stage 2: Initial ROI registration to all T2 atlases


SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

------------------- INFO --------------------- Started stage 2: Initial ROI registration to all T2 atlases

**** !! ERROR !! *** Validity check at end of stage 2 detected missing files. (multiatlas/tseg_left_train000/greedy_atlas_to_s ubj_affine.mat and 231 other files).


Stage 3 ashs_main execution log timestamp: Tue Aug 8 15:40:54 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 3 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**** !! ERROR !! *** Validity check at end of stage 2 detected missing files. (multiatlas/tseg_left_train022/greedy_atlas_to_s ubj_warp.nii.gz and 129 other files).


Stage 4 ashs_main execution log timestamp: Tue Aug 8 15:43:08 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 4 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**** !! ERROR !! *** Validity check at end of stage 3 detected missing files. (multiatlas/tseg_right_train027/greedy_atlasto subj_warp.nii.gz and 11 other files).


Stage 5 ashs_main execution log timestamp: Tue Aug 8 15:45:21 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 5 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**** !! ERROR !! *** Validity check at end of stage 4 detected missing files. (multiatlas/fusion/lfseg_corr_nogray_left.nii.gz and 295 other files).


Stage 6 ashs_main execution log timestamp: Tue Aug 8 15:47:34 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 6 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**** !! ERROR !! *** Validity check at end of stage 5 detected missing files. (multiatlas/fusion/lfseg_corr_nogray_left.nii.gz and 301 other files).


Stage 7 ashs_main execution log timestamp: Tue Aug 8 15:49:48 CDT 2023 invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 7 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs /ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0 7_output_slurm_parallel directory: /work/project/npavlovikj/ashs/2.0.0 environment: ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel' ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0 ASHS_SUBJID=sub07 ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz ASHS_USE_SLURM=1 ASHS_USE_SOME_BATCHENV=1 ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810 T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**** !! ERROR !! *** Validity check at end of stage 6 detected missing files. (multiatlas/fusion/lfseg_corr_nogray_left.nii.gz and 307 other files).


The files for Stage 2 and Stage 3 that give the Validity check error do exist in the output directory:

@. sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz -rw-r--r-- 1 npavlovikj project 4211389 Aug 8 15:41 multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz @. sub07_output_slurm_parallel]$ ls -ls multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz 4465 -rw-r--r-- 1 npavlovikj project 4511258 Aug 8 15:43 multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz @.*** sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat -rw-r--r-- 1 npavlovikj project 127 Aug 8 15:38 multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat

Do you have any other suggestions for me to try?

Thank you, Natasha

— Reply to this email directly, view it on GitHub https://github.com/pyushkevich/ashs/issues/9#issuecomment-1670310896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJPEW6IGUHWSQMAZACEBMTXUKTGPANCNFSM6AAAAAA3I3ABEQ . You are receiving this because you commented.Message ID: @.***>

npavlovikj commented 1 year ago

Hi Paul,

I added sleep 900 after Stage 2, but I got the same error pretty soon after the first job for Stage 2 started:

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

As for echoing the commands in _ashslib.sh, do you know what variable that is?

I tried echoing MISSFILE, CHK_FILE, MADIR and TDIR, and these are the results I got:

$ echo $MISSFILE
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/.missing
$ echo $CHK_FILE

$ echo $MADIR
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas

$ echo $TDIR
tseg_right_train028

All the printed paths and files are valid. The .missing file has 232 files, and I checked a few of them, and they all exist on the file system:

[npavlovikj@login2 sub07_output_slurm_parallel]$ head -n3 .missing 
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
-rw-r--r-- 1 npavlovikj project 126 Aug  8 17:02 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
-rw-r--r-- 1 npavlovikj project 4339419 Aug  8 17:03 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat
-rw-r--r-- 1 npavlovikj project 133 Aug  8 17:02 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat

Thank you, Natasha

npavlovikj commented 1 year ago

Hi Paul,

I have been playing a bit with the sleep command, and I figured out that when the missing files are being written is what is causing issues, since there is some delay in writing those files (maybe they come from different SLURM jobs that are not finished yet).

I tried adding sleep 900 before https://github.com/pyushkevich/ashs/blob/8d0098aa096dd40e257ea8637a0ca462c8cef0ef/bin/ashs_lib.sh#L2088. While the number of reported missing files was reduced, some were still missing for Stage 3 and the job failed.

Then, I increased the number to 1800, and I was able to have a successful run when using SLURM with ASHS 2.0.0.

We do have Lustre on our /work shared file system. Waiting for 30 minutes for all files to be written looks like a long time. Is it possible to have some changes in the code to address this, maybe wait for all SLURM jobs to be finished before checking for the missing files?

Thank you, Natasha