riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
100 stars 41 forks source link

Chained HTCondor tasks #193

Open TheRealLoliges486 opened 4 days ago

TheRealLoliges486 commented 4 days ago

Question

Hello,

what is the recommended way of running tasks with HTCondor workflow which rely on other tasks with HTCondor workflows?

Concretely, I have a task called FTest which has subtasks FTestCategory. The latter must run with HTCondor. FTestCategory has a requirement called Trees2WS which again consists of subtasks Trees2WSSingleProcess which should run on HTCondor as well.

Now, when I execute law run FTest --workers 4, then law creates the Condor submission for FTestCategory and on that respective node the Condor submission for Trees2WSSingleProcess but ultimately fails, since on LXPLUS the condor nodes themselfs cannot access the schedd.

The resulting error is this one, which I guess is due to the inaccessibility of the schedd on Condor nodes:

Traceback (most recent call last):
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/luigi/worker.py", line 210, in run
    new_deps = self._run_get_new_deps()
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/luigi/worker.py", line 138, in _run_get_new_deps
    task_gen = self.task.run()
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/law/workflow/remote.py", line 628, in run
    return self._run_impl()
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/law/workflow/remote.py", line 700, in _run_impl
    self.submit()
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/law/workflow/remote.py", line 882, in submit
    job_ids, submission_data = self._submit_group(submit_jobs)
  File "/afs/cern.ch/user/n/niharrin/cernbox/PhD/Higgs/CMSSW_14_1_0_pre4/src/flashggFinalFit/law/install_dir/lib/python3.9/site-packages/law/contrib/htcondor/workflow.py", line 190, in _submit_group
    c, p = job_id.split(".")
AttributeError: 'Exception' object has no attribute 'split'

How do you handle such chained HTCondor workflows?

Thanks a lot!!

riga commented 4 days ago

Hi,

two things before going into depth of the workflow -> task -> workflow pattern.

  1. The error you are seeing is a bug that we also stumbled upon recently. I will hopefully have time late next week to debug this further. It's quite elusive and seems to appear only in edge cases (at last on our end).

  2. To make sure I understand, is this the situation you want to achieve? (workflows have a purple border)

flowchart TD
    %% aliases
    ftest(FTest)
    ftestcat1[FTestCategory]
    ftestcat2[FTestCategory]
    t2ws1(Trees2WS)
    t2ws2(Trees2WS)
    t2wss11[Trees2WSSingleProcess]
    t2wss12[Trees2WSSingleProcess]
    t2wss21[Trees2WSSingleProcess]
    t2wss22[Trees2WSSingleProcess]

    %% styles
    classDef WF stroke: #83b, stroke-width: 3px

    %% assign styles
    class ftestcat1 WF
    class ftestcat2 WF
    class t2wss11 WF
    class t2wss12 WF
    class t2wss21 WF
    class t2wss22 WF

    %% actual graph
    ftest --> ftestcat1
    ftest --> ftestcat2
    ftestcat1 --> t2ws1
    ftestcat2 --> t2ws2
    t2ws1 --> t2wss11
    t2ws1 --> t2wss12
    t2ws2 --> t2wss21
    t2ws2 --> t2wss22

If not, feel free to change the graph and paste it here in GH in a ```mermaid code box.

TheRealLoliges486 commented 4 days ago

Hi,

Yes, for now this is the situation I want to achieve. Ideally, Trees2WS should run only once per execution of law (as it produces all the ingredients for FTestCategory).