pegasus-isi / montage-workflow-v3

A new Python DAX generator version of the classic Montage workflow. This workflow uses the Montage toolkit to re-project, background correct and add astronomical images into custom mosaics.
Apache License 2.0
9 stars 2 forks source link

Workflow always stuck on stage_in_local_local_0_0 when using container #1

Open junyussh opened 1 week ago

junyussh commented 1 week ago

You can reproduce this in the Pegasus tutorial container. I downloaded the montage-workflow-v3.sif and modified the apptainer command to make it use the local image in example-dss-containers.sh.

#!/bin/bash

set -e

export PYTHONPATH=`pegasus-config --python`:$PYTHONPATH

if [ ! -e montage-workflow.py ]; then
    echo "Error: You have to run this script from the top level workflow checkout" 1>&2
    exit 1
fi
rm -rf data

apptainer exec \
            --bind $PWD \
            montage-workflow-v3.sif \
            $PWD/montage-workflow.py \
                --work-dir $PWD \
                --tc-target container \
                --center "56.7 24.00" \
                --degrees 1.0 \
                --band dss:DSS2B:blue \
                --band dss:DSS2R:green \
                --band dss:DSS2IR:red

pegasus-plan \
        --dir work \
        --output-site local \
        --cluster horizontal \
        data/montage-workflow.yml

After submitting the plan for 10 minutes, you can see that there's only stage_in_local_local_0_0 still running. Now, it has been running over an hour while the workflow gallery says it should only take 5 minutes.

[scitech@1dfde75eea2f 01]$ pegasus-status -l /home/scitech/montage-workflow-v3/work/scitech/pegasus/montage/run0001
STAT  IN_STATE  JOB                                                                                 
Run   01:36:18  montage-0 ( /home/scitech/montage-workflow-v3/work/scitech/pegasus/montage/run0001 )
Run   01:35:48   ┗━stage_in_local_local_0_0                                                         
Summary: 2 Condor jobs total (R:2)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
  248     0     0     1     0    12     0   4.6 Running *montage-0.dag     

In run dir, no stdout or stderr is printed to files. The contents of stdin are shown below.

[
 { "type": "transfer",
   "linkage": "input",
   "lfn": "montage.sif",
   "id": 1,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "CONTAINER_SITE", "url": "https://data.isi.edu/montage/images/montage-workflow-v3.sif", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./montage.sif", "type": "singularity" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_red_002_001.fits",
   "id": 2,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_red&r=56.589814&d=23.600940&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_red_002_001.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_red_003_002.fits",
   "id": 3,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_red&r=56.041666&d=24.099562&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_red_003_002.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_red_004_003.fits",
   "id": 4,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_red&r=55.489361&d=24.596109&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_red_004_003.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_blue_003_001.fits",
   "id": 5,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_blue&r=56.044216&d=23.599602&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_blue_003_001.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_blue_002_001.fits",
   "id": 6,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_blue&r=56.589814&d=23.600940&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_blue_002_001.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_blue_004_002.fits",
   "id": 7,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_blue&r=55.494068&d=24.096242&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_blue_004_002.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_ir_001_003.fits",
   "id": 8,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_ir&r=57.138831&d=24.600314&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_001_003.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_ir_004_001.fits",
   "id": 9,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_ir&r=55.498737&d=23.596359&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_004_001.fits" }
   ] }
 ,
 { "type": "transfer",
   "linkage": "input",
   "lfn": "poss2ukstu_ir_002_004.fits",
   "id": 10,
   "generate_checksum": true,
   "src_urls": [
     { "site_label": "ipac", "url": "http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_ir&r=56.588519&d=25.100795&e=J2000&w=42.60&h=42.60&f=fits&c=gz", "priority": 10 }
   ],
   "dest_urls": [
     { "site_label": "local", "url": "file:////home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_002_004.fits" }
   ] }
]
rynge commented 1 week ago

It could just be a slow download. You could wait for the workflow to finish (takes some time due to retries/backoffs), and then run pegasus-analyzer which will tell you what the issue is.

In the meantime, you can check that access to the data and containers is working correctly. On the same host you are running the workflow, try the following two commands:

wget https://data.isi.edu/montage/images/montage-workflow-v3.sif

wget 'http://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_red&r=56.589814&d=23.600940&e=J2000&w=42.60&h=42.60&f=fits&c=gz'

Are they getting stuck? Taking a long time?

junyussh commented 1 week ago

Yes, the analyzer says it was a download failure.

2024-11-07 05:14:20,919   ERROR:  Command exited with non-zero exit code (4): /usr/bin/wget -nv --no-cookies --no-check-certificate --timeout=300 --tries=1 -O '//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./montage.sif' 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
2024-11-07 05:17:47,647    INFO:  2024-11-07 05:17:47 URL:https://archive.stsci.edu/cgi-bin/dss_search?v=poss2ukstu_ir&r=55.498737&d=23.596359&e=J2000&w=42.60&h=42.60&f=fits&c=gz [6787976/6787976] -> "//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_004_001.fits" [1]
2024-11-07 05:17:47,647    INFO:  /usr/bin/pegasus-integrity --generate-fullstat-yaml="poss2ukstu_ir_004_001.fits=//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./poss2ukstu_ir_004_001.fits"
2024-11-07 05:20:03,774    INFO:  --------------------------------------------------------------------------------
2024-11-07 05:20:03,775    INFO:  Starting transfers - attempt 2
2024-11-07 05:20:05,776    INFO:  /usr/bin/wget -nv --no-cookies --no-check-certificate --timeout=300 --tries=1 -O '//home/scitech/montage-workflow-v3/wf-scratch/LOCAL/scitech/pegasus/montage/run0001/./montage.sif' 'https://data.isi.edu/montage/images/montage-workflow-v3.sif'

I am wondering if it is possible to use the downloaded sif image when submitting the plan instead of downloading it from web. I just want the workflow to use the local image (the sif file). It seems that the image value of Container API cannot be the local file server or it throws an error when submitting. Note: The workflow runs on the local pool.

container = None
  if tc_target == 'container':
      container = Container('montage',
          Container.SINGULARITY,
          'file:///home/scitech/montage-workflow-v3/montage-workflow-v3.sif'
          ).add_env(MONTAGE_HOME='/opt/Montage')
      tc.add_containers(container)
2024.11.07 17:00:43.418 UTC: [FATAL ERROR]  
 [1] java.lang.RuntimeException: Site Selector could not map the job mDiffFit with id mDiffFit_ID0000280
to any of the execution sites [condorpool]
using the Transformation Mapper (All Mode - Handle both Installed and Stageable Executables on all sites)
This error is most likely due to an error in the transformation catalog.
Make sure that the mDiffFit transformation exists with matching system information for sites 
[condorpool] you are trying to plan for {condorpool={arch=x86_64 os=linux}}
Candidate Entries found were [

 Logical Namespace : null
 Logical Name      : mDiffFit
 Version           : null
 Resource Id       : insidecontainer
 Physical Name     : /opt/Montage/bin/mDiffFit
 SysInfo           : {arch=x86_64 os=linux}
 TYPE              : INSTALLED
 BYPASS              : false
 Profiles : profile condor request_memory 1 GB
profile pegasus clusters.size 3

 Notifications: 
 Container    : cont montage.sif{
        type            singularity
        image           file:///home/scitech/montage-workflow-v3/montage-workflow-v3.sif
        image_site      null
        bypass  false
        profile         env     MONTAGE_HOME /opt/Montage
}

 Compound Tx  : Transformation -> mDiffFit
         executable -> 
 Logical Name :mFitplane
 Type         :executable
 Size         :-1.0
 Transient Flags (transfer,optional,dontRegister,cleanup,integrity,bypass,plannerUse): ( 0,false,false,true,true,false,false)metadata
         executable -> 
 Logical Name :mDiff
 Type         :executable
 Size         :-1.0
 Transient Flags (transfer,optional,dontRegister,cleanup,integrity,bypass,plannerUse): ( 0,false,false,true,true,false,false)metadata
Notifications -> 
] at edu.isi.pegasus.planner.refiner.InterPoolEngine.complainForFailedSiteMapping(InterPoolEngine.java:860) 
rynge commented 1 week ago

It seems Pegasus is a little bit sensitive to the site in this case - we will improve this in the next version of Pegasus. In the mean time, here is a workaround. Set image_site="local" on the container, and site='condorpool' on the transformation. A diff that works for me:

diff --git a/montage-workflow.py b/montage-workflow.py
index 14d6474..31fa996 100755
--- a/montage-workflow.py
+++ b/montage-workflow.py
@@ -69,7 +69,8 @@ def build_transformation_catalog(tc_target, wf):
     if tc_target == 'container':
         container = Container('montage',
             Container.SINGULARITY,
-            'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
+            'file:///local-scratch/rynge/montage-workflow-v3/montage-workflow-v3.sif',
+            image_site="local"
             ).add_env(MONTAGE_HOME='/opt/Montage')
         tc.add_containers(container)

@@ -87,7 +88,7 @@ def build_transformation_catalog(tc_target, wf):
         else:
             # container
             transformation = Transformation(fname,
-                                            site='insidecontainer',
+                                            site='condorpool',
                                             pfn=os.path.join(base_dir, fname),
                                             container=container,
                                             is_stageable=False)
junyussh commented 1 week ago

I've followed your workaround. The sif file is located at /home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif.

diff --git a/montage-workflow.py b/montage-workflow.py
index 14d6474..2c4a209 100755
--- a/montage-workflow.py
+++ b/montage-workflow.py
@@ -69,7 +69,8 @@ def build_transformation_catalog(tc_target, wf):
     if tc_target == 'container':
         container = Container('montage',
             Container.SINGULARITY,
-            'https://data.isi.edu/montage/images/montage-workflow-v3.sif'
+            'file:///local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif',
+            image_site="local"
             ).add_env(MONTAGE_HOME='/opt/Montage')
         tc.add_containers(container)

@@ -87,7 +88,7 @@ def build_transformation_catalog(tc_target, wf):
         else:
             # container
             transformation = Transformation(fname,
-                                            site='insidecontainer',
+                                            site='condorpool',
                                             pfn=os.path.join(base_dir, fname),
                                             container=container,
                                             is_stageable=False)

Then I got this result.

===========================stage_in_remote_local_0_1============================

 last state: POST_SCRIPT_FAILED
       site: local
submit file: 00/01/stage_in_remote_local_0_1.sub
output file: 00/01/stage_in_remote_local_0_1.out.000
 error file: 00/01/stage_in_remote_local_0_1.err.000

-------------------------------Task #1 - Summary--------------------------------

site        : local
hostname    : 1dfde75eea2f
executable  : /usr/bin/pegasus-transfer
arguments   :  -n pegasus::transfer -N null -i - -R local  -L montage -T 2024-11-08T12:18:06+00:00 /usr/bin/pegasus-transfer  --threads 2 
exitcode    : 1
working dir : /home/scitech/montage-workflow-v3/work/scitech/pegasus/montage/run0013

------------------Task #1 - pegasus::transfer - None - stdout-------------------

2024-11-08 04:20:44,919    INFO:  Reading URL pairs from stdin
2024-11-08 04:20:44,920    INFO:  1 transfers loaded
2024-11-08 04:20:44,920    INFO:  PATH=/home/scitech/.pyenv/bin:/home/scitech//.local/bin:/home/scitech//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib64/mpich/bin
2024-11-08 04:20:44,920    INFO:  LD_LIBRARY_PATH=
2024-11-08 04:20:44,932    INFO:  --------------------------------------------------------------------------------
2024-11-08 04:20:44,932    INFO:  Starting transfers - attempt 1
2024-11-08 04:20:46,934   ERROR:  Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:23:05,008    INFO:  --------------------------------------------------------------------------------
2024-11-08 04:23:05,008    INFO:  Starting transfers - attempt 2
2024-11-08 04:23:07,010   ERROR:  Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:28:07,092    INFO:  --------------------------------------------------------------------------------
2024-11-08 04:28:07,092    INFO:  Starting transfers - attempt 3
2024-11-08 04:28:09,094   ERROR:  Expected local file does not exist: /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif
2024-11-08 04:28:09,095    INFO:  --------------------------------------------------------------------------------
2024-11-08 04:28:09,095    INFO:  Stats: Total 3 transfers, 0.0 B transferred in 444 seconds. Rate: 0.0 B/s (0.0 b/s)
2024-11-08 04:28:09,095    INFO:         Between sites local->local : 3 transfers, 0.0 B transferred in 444 seconds. Rate: 0.0 B/s (0.0 b/s)
2024-11-08 04:28:09,095 CRITICAL:  Some transfers failed! See above, and possibly stderr.

**************************************Done**************************************

Do I have to write the sites.yml additionally? If it does, can you share your sites.yml? I've tried several versions of sites.yml and they went wrong during the planning.

pegasus: '5.0'
sites:
- name: local
  directories:
  - type: localScratch
    path: /tmp/wf/scratch
    fileServers:
    - url: file:///home/scitech/montage-workflow-v3/scratch
      operation: all
pegasus: '5.0'
sites:
- name: condorpool
  directories:
  - type: localScratch
    path: /tmp/wf/scratch
    fileServers:
    - url: file:///home/scitech/montage-workflow-v3/scratch
      operation: all
$ pegasus-plan \
>         --dir work \
>         --output-site local \
>         --cluster horizontal \
>         data/montage-workflow.yml
2024.11.08 12:53:25.105 UTC: [FATAL ERROR]  
 [1] java.lang.RuntimeException: [DeployWorkerPackage] Unable to determine URL Prefix for the FileServer  for operation put for shared scratch file system on site: local at edu.isi.pegasus.planner.refiner.Engine.complainForHeadNodeURLPrefix(Engine.java:125)
rynge commented 1 week ago

Did you change that file:// location to your location (/home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif)?

You should not need a site catalog - the default here is using HTCondor's builtin file transfers.

What version of Pegasus are you using?

junyussh commented 1 week ago

Yes, I've changed the image path in montage-workflow.py as the previous comment showing. The image path is set to file:///local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif or you mean I shouldn't write the local-scratch word in the location?

$ realpath scratch/montage-workflow-v3.sif 
/home/scitech/montage-workflow-v3/scratch/montage-workflow-v3.sif

I also tried to change the path to file:///local-scratch/montage-workflow-v3.sif but still got transfer error. Because I thought the local-scratch may indicate to /home/scitech/montage-workflow-v3/scratch/.

container = None
if tc_target == 'container':
    container = Container('montage',
        Container.SINGULARITY,
        'file:///local-scratch/montage-workflow-v3.sif',
        image_site="local"
        ).add_env(MONTAGE_HOME='/opt/Montage')
    tc.add_containers(container)

My Pegasus version is 5.0.8.

$ pegasus-version 
5.0.8
rynge commented 1 week ago

I mean that /local-scratch/scitech/montage-workflow-v3/montage-workflow-v3.sif was the path on my system. You have to replace it with the path on your system (/home/scitech/montage-workflow-v3/montage-workflow-v3.sif).