pegasus-isi / freesurfer-osg-workflow

A Pegasus workflow for running FreeSurfer on the Open Science Grid
2 stars 3 forks source link

STARTER at 192.168.4.2 failed to send file(s) to <192.170.227.166:9618> #7

Open soichih opened 4 years ago

soichih commented 4 years ago

I was able to run the test job successfully, and obtained what seems to be a valid freesurfer output.

However, I ran another test job using the same t1 input, and this time it failed with this error message.

$ pegasus-analyzer work

************************************Summary*************************************

 Submit Directory   : work
 Total jobs         :     14 (100.00%)
 # jobs succeeded   :      4 (28.57%)
 # jobs failed      :      1 (7.14%)
 # jobs held        :      1 (7.14%)
 # jobs unsubmitted :      9 (64.29%)

*******************************Held jobs' details*******************************

==========================autorecon1_sh_subject_00001===========================

submit file            : autorecon1_sh_subject_00001.sub
last_job_instance_id   : 7
reason                 :  Error from slot1_6@condor-worker-7c7d97844f-ht4ml@river-c065.ssl-hep.org: STARTER at 192.168.4.2 failed to send file(s) to <192.170.227.166:9618>: error reading from /var/lib/condor/execute/dir_1807/subject_recon1_output.tar.xz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.170.236.165:60541>

******************************Failed jobs' details******************************

==========================autorecon1_sh_subject_00001===========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon1_sh_subject_00001.sub
output file: 00/00/autorecon1_sh_subject_00001.out.002
 error file: 00/00/autorecon1_sh_subject_00001.err.002

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : condor-worker-7c7d97844f-ht4ml
executable  : /srv/autorecon1_sh
arguments   :   subject   subject-t1.nii.gz   4   -notal-check   -cw256  
exitcode    : 1
working dir : /srv

----------------Task #1 - autorecon1.sh - subject_00001 - stdout----------------

Will use SUBJECTS_DIR=/srv/tmp.1PLxkG0bOW
Subject Stamp: freesurfer-Linux-centos6_x86_64-stable-pub-v6.0.1-f53a55a
Current Stamp: freesurfer-Linux-centos6_x86_64-stable-pub-v6.0.1-f53a55a
INFO: SUBJECTS_DIR is /srv/tmp.1PLxkG0bOW
Actual FREESURFER_HOME /opt/freesurfer-6.0.1
Linux condor-worker-7c7d97844f-ht4ml 5.3.2-1.el7.elrepo.x86_64 #1 SMP Tue Oct 1 08:18:21 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
'/opt/freesurfer-6.0.1/bin/recon-all' -> '/srv/tmp.1PLxkG0bOW/subject/scripts/recon-all.local-copy'
-cw256 option is now persistent (remove with -clean-cw256)
/srv/tmp.1PLxkG0bOW/subject

 mri_convert /srv/subject-t1.nii.gz /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz 

mri_convert.bin /srv/subject-t1.nii.gz /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz 
$Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $
reading from /srv/subject-t1.nii.gz...
TR=6.40, TE=0.00, TI=0.00, flip angle=0.00
i_ras = (1, 0, 0)
j_ras = (0, 1, 0)
k_ras = (0, 0, 1)
writing to /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz...
#--------------------------------------------

How should I handle this error?

soichih commented 4 years ago

Another test failed.

*******************************Held jobs' details*******************************

==========================autorecon1_sh_subject_00001===========================

submit file            : autorecon1_sh_subject_00001.sub
last_job_instance_id   : 9
reason                 :  Error from slot1_2@condor-worker-7c7d97844f-m2hrq@river-c048.ssl-hep.org: STARTER at 192.168.8.30 failed to send file(s) to <192.170.227.166:9618>: error reading from /var/lib/condor/execute/dir_3291/subject_recon1_output.tar.xz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.170.236.148:34378>

******************************Failed jobs' details******************************

==========================autorecon1_sh_subject_00001===========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon1_sh_subject_00001.sub
output file: 00/00/autorecon1_sh_subject_00001.out.002
 error file: 00/00/autorecon1_sh_subject_00001.err.002

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : -
executable  : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eab20644623da3b9fee55eb/work/00/00/autorecon1_sh_subject_00001.sh
arguments   : -
exitcode    : -1
working dir : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eab20644623da3b9fee55eb/work
soichih commented 4 years ago

Here is another instance of this error.

*******************************Held jobs' details*******************************

===========================autorecon1_sh_output_00001===========================

submit file            : autorecon1_sh_output_00001.sub
last_job_instance_id   : 5
reason                 :  Error from slot1_5@glidein_19785_659649776@lnxfarm338.colorado.edu: STARTER at 192.168.4.138 failed to send file(s) to <192.170.227.166:9618>; SHADOW at 192.170.227.166 failed to write to file /public/hayashis/scratch/work/output_recon1_output.tar.xz: (errno 2) No such file or directory

******************************Failed jobs' details******************************

=========================autorecon2_sh_output-rh_00002==========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon2_sh_output-rh_00002.sub
output file: 00/00/autorecon2_sh_output-rh_00002.out
 error file: 00/00/autorecon2_sh_output-rh_00002.err

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : -
executable  : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eadfd754623da2032ef1e58/work/00/00/autorecon2_sh_output-rh_00002.sh
arguments   : -
exitcode    : -1
working dir : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eadfd754623da2032ef1e58/work

-----------Job stderr file - 00/00/autorecon2_sh_output-rh_00002.err------------

Job submission failed because of HTCondor event SUBMIT_FAILED

=========================autorecon2_sh_output-lh_00003==========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon2_sh_output-lh_00003.sub
output file: 00/00/autorecon2_sh_output-lh_00003.out
 error file: 00/00/autorecon2_sh_output-lh_00003.err

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : -
executable  : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eadfd754623da2032ef1e58/work/00/00/autorecon2_sh_output-lh_00003.sh
arguments   : -
exitcode    : -1
working dir : /public/hayashis/workdir/5ea9b8ff4623dab009eddf97/5eadfd754623da2032ef1e58/work

-----------Job stderr file - 00/00/autorecon2_sh_output-lh_00003.err------------

Job submission failed because of HTCondor event SUBMIT_FAILED