Open junyussh opened 1 week ago
It could just be a slow download. You could wait for the workflow to finish (takes some time due to retries/backoffs), and then run pegasus-analyzer
which will tell you what the issue is.
In the meantime, you can check that access to the data and containers is working correctly. On the same host you are running the workflow, try the following two commands:
wget https://data.isi.edu/montage/images/montage-workflow-v3.sif
wget 'http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_red&r=56.589814&d=23.600940&e=J2000&w=42.60&h=42.60&f=fits&c=gz'
Are they getting stuck? Taking a long time?
Yes, the analyzer says it was a download failure.
2024-11-07 05:14:20,919 ERROR: Command exited with non-zero exit code (4): /usr/bin/wget -nv --no-cookies --no-check-certificate --timeout=300 --tries=1 -O '//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./montage.sif' 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
2024-11-07 05:17:47,647 INFO: 2024-11-07 05:17:47 URL:https://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_ir&r=55.498737&d=23.596359&e=J2000&w=42.60&h=42.60&f=fits&c=gz [6787976/6787976] -> "//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_004_001.fits" [1]
2024-11-07 05:17:47,647 INFO: /usr/bin/pegasus-integrity --generate-fullstat-yaml="poss2ukstu_ir_004_001.fits=//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_004_001.fits"
2024-11-07 05:20:03,774 INFO: --------------------------------------------------------------------------------
2024-11-07 05:20:03,775 INFO: Starting transfers - attempt 2
2024-11-07 05:20:05,776 INFO: /usr/bin/wget -nv --no-cookies --no-check-certificate --timeout=300 --tries=1 -O '//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./montage.sif' 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
I am wondering if it is possible to use the downloaded sif image when submitting the plan instead of downloading it from web. I just want the workflow to use the local image (the sif file). It seems that the image value of Container API cannot be the local file server or it throws an error when submitting. Note: The workflow runs on the local pool.
container = None
if tc_target == 'container':
container = Container('montage',
Container.SINGULARITY,
'file:///home/scitech/montage-workflow-v3/montage-workflow-v3.sif'
).add_env(MONTAGE_HOME='/opt/Montage')
tc.add_containers(container)
2024.11.07 17:00:43.418 UTC: [FATAL ERROR]
[1] java.lang.RuntimeException: Site Selector could not map the job mDiffFit with id mDiffFit_ID0000280
to any of the execution sites [condorpool]
using the Transformation Mapper (All Mode - Handle both Installed and Stageable Executables on all sites)
This error is most likely due to an error in the transformation catalog.
Make sure that the mDiffFit transformation exists with matching system information for sites
[condorpool] you are trying to plan for {condorpool={arch=x86_64 os=linux}}
Candidate Entries found were [
Logical Namespace : null
Logical Name : mDiffFit
Version : null
Resource Id : insidecontainer
Physical Name : /opt/Montage/bin/mDiffFit
SysInfo : {arch=x86_64 os=linux}
TYPE : INSTALLED
BYPASS : false
Profiles : profile condor request_memory 1 GB
profile pegasus clusters.size 3
Notifications:
Container : cont montage.sif{
type singularity
image file:///home/scitech/montage-workflow-v3/montage-workflow-v3.sif
image_site null
bypass false
profile env MONTAGE_HOME /opt/Montage
}
Compound Tx : Transformation -> mDiffFit
executable ->
Logical Name :mFitplane
Type :executable
Size :-1.0
Transient Flags (transfer,optional,dontRegister,cleanup,integrity,bypass,plannerUse): ( 0,false,false,true,true,false,false)metadata
executable ->
Logical Name :mDiff
Type :executable
Size :-1.0
Transient Flags (transfer,optional,dontRegister,cleanup,integrity,bypass,plannerUse): ( 0,false,false,true,true,false,false)metadata
Notifications ->
] at edu.isi.pegasus.planner.refiner.InterPoolEngine.complainForFailedSiteMapping(InterPoolEngine.java:860)
It seems Pegasus is a little bit sensitive to the site
in this case - we will improve this in the next version of Pegasus. In the mean time, here is a workaround. Set image_site="local"
on the container, and site='condorpool'
on the transformation. A diff that works for me:
diff --git a/montage-workflow.py b/montage-workflow.py
index 14d6474..31fa996 100755
--- a/montage-workflow.py
+++ b/montage-workflow.py
@@ -69,7 +69,8 @@ def build_transformation_catalog(tc_target, wf):
if tc_target == 'container':
container = Container('montage',
Container.SINGULARITY,
- 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
+ 'file:///local-scratch/rynge/montage-workflow-v3/montage-workflow-v3.sif',
+ image_site="local"
).add_env(MONTAGE_HOME='/opt/Montage')
tc.add_containers(container)
@@ -87,7 +88,7 @@ def build_transformation_catalog(tc_target, wf):
else:
# container
transformation = Transformation(fname,
- site='insidecontainer',
+ site='condorpool',
pfn=os.path.join(base_dir, fname),
container=container,
is_stageable=False)
I've followed your workaround. The sif file is located at /home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif
.
diff --git a/montage-workflow.py b/montage-workflow.py
index 14d6474..2c4a209 100755
--- a/montage-workflow.py
+++ b/montage-workflow.py
@@ -69,7 +69,8 @@ def build_transformation_catalog(tc_target, wf):
if tc_target == 'container':
container = Container('montage',
Container.SINGULARITY,
- 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
+ 'file:///local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif',
+ image_site="local"
).add_env(MONTAGE_HOME='/opt/Montage')
tc.add_containers(container)
@@ -87,7 +88,7 @@ def build_transformation_catalog(tc_target, wf):
else:
# container
transformation = Transformation(fname,
- site='insidecontainer',
+ site='condorpool',
pfn=os.path.join(base_dir, fname),
container=container,
is_stageable=False)
Then I got this result.
===========================stage_in_remote_local_0_1============================
last state: POST_SCRIPT_FAILED
site: local
submit file: 00/01/stage_in_remote_local_0_1.sub
output file: 00/01/stage_in_remote_local_0_1.out.000
error file: 00/01/stage_in_remote_local_0_1.err.000
-------------------------------Task #1 - Summary--------------------------------
site : local
hostname : 1dfde75eea2f
executable : /usr/bin/pegasus-transfer
arguments : -n pegasus::transfer -N null -i - -R local -L montage -T 2024-11-08T12:18:06+00:00 /usr/bin/pegasus-transfer --threads 2
exitcode : 1
working dir : /home/scitech/montage-workflow-v3/work/scitech/pegasus/montage/run0013
------------------Task #1 - pegasus::transfer - None - stdout-------------------
2024-11-08 04:20:44,919 INFO: Reading URL pairs from stdin
2024-11-08 04:20:44,920 INFO: 1 transfers loaded
2024-11-08 04:20:44,920 INFO: PATH=/home/scitech/.pyenv/bin:/home/scitech//.local/bin:/home/scitech//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib64/mpich/bin
2024-11-08 04:20:44,920 INFO: LD_LIBRARY_PATH=
2024-11-08 04:20:44,932 INFO: --------------------------------------------------------------------------------
2024-11-08 04:20:44,932 INFO: Starting transfers - attempt 1
2024-11-08 04:20:46,934 ERROR: Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:23:05,008 INFO: --------------------------------------------------------------------------------
2024-11-08 04:23:05,008 INFO: Starting transfers - attempt 2
2024-11-08 04:23:07,010 ERROR: Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:28:07,092 INFO: --------------------------------------------------------------------------------
2024-11-08 04:28:07,092 INFO: Starting transfers - attempt 3
2024-11-08 04:28:09,094 ERROR: Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:28:09,095 INFO: --------------------------------------------------------------------------------
2024-11-08 04:28:09,095 INFO: Stats: Total 3 transfers, 0.0 B transferred in 444 seconds. Rate: 0.0 B/s (0.0 b/s)
2024-11-08 04:28:09,095 INFO: Between sites local->local : 3 transfers, 0.0 B transferred in 444 seconds. Rate: 0.0 B/s (0.0 b/s)
2024-11-08 04:28:09,095 CRITICAL: Some transfers failed! See above, and possibly stderr.
**************************************Done**************************************
Do I have to write the sites.yml additionally? If it does, can you share your sites.yml? I've tried several versions of sites.yml and they went wrong during the planning.
pegasus: '5.0'
sites:
- name: local
directories:
- type: localScratch
path: /tmp/wf/scratch
fileServers:
- url: file:///home/scitech/montage-workflow-v3/scratch
operation: all
pegasus: '5.0'
sites:
- name: condorpool
directories:
- type: localScratch
path: /tmp/wf/scratch
fileServers:
- url: file:///home/scitech/montage-workflow-v3/scratch
operation: all
$ pegasus-plan \
> --dir work \
> --output-site local \
> --cluster horizontal \
> data/montage-workflow.yml
2024.11.08 12:53:25.105 UTC: [FATAL ERROR]
[1] java.lang.RuntimeException: [DeployWorkerPackage] Unable to determine URL Prefix for the FileServer for operation put for shared scratch file system on site: local at edu.isi.pegasus.planner.refiner.Engine.complainForHeadNodeURLPrefix(Engine.java:125)
Did you change that file:// location to your location (/home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif
)?
You should not need a site catalog - the default here is using HTCondor's builtin file transfers.
What version of Pegasus are you using?
Yes, I've changed the image path in montage-workflow.py as the previous comment showing. The image path is set to file:///local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
or you mean I shouldn't write the local-scratch
word in the location?
$ realpath scratch/montage-workflow-v3.sif
/home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif
I also tried to change the path to file:///local-scratch/montage-workflow-v3.sif
but still got transfer error. Because I thought the local-scratch
may indicate to /home/scitech/montage-workflow-v3/scratch/
.
container = None
if tc_target == 'container':
container = Container('montage',
Container.SINGULARITY,
'file:///local-scratch/montage-workflow-v3.sif',
image_site="local"
).add_env(MONTAGE_HOME='/opt/Montage')
tc.add_containers(container)
My Pegasus version is 5.0.8
.
$ pegasus-version
5.0.8
I mean that /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
was the path on my system. You have to replace it with the path on your system (/home/scitech/montage-workflow-v3/montage-workflow-v3.sif
).
You can reproduce this in the Pegasus tutorial container. I downloaded the
montage-workflow-v3.sif
and modified the apptainer command to make it use the local image inexample-dss-containers.sh
.After submitting the plan for 10 minutes, you can see that there's only
stage_in_local_local_0_0
still running. Now, it has been running over an hour while the workflow gallery says it should only take 5 minutes.In run dir, no stdout or stderr is printed to files. The contents of stdin are shown below.