riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
96 stars 39 forks source link

Why is workload_requires needed? #182

Closed solo-driven closed 1 month ago

solo-driven commented 1 month ago

Question

ALL EXAMPLES ARE RUN LOCALLY

in the example for htcondor (https://github.com/riga/law/blob/master/examples/sequential_htcondor_at_cern/analysis/tasks.py) workload_requires is being used and results in the following graph: image Scheduled 45 tasks of which:

But when I just comment it I get the following more clearer graph: image

Scheduled 39 tasks of which:

Does this change anything? Other than number of tasks decreases when no workflow_requirements is not provided from 45 to 39

In addition it is also possible by changing reruires and run to:

def requires(self):
    # require CreateChars for each index referred to by the branch_data of _this_ instance

    return CreateChars.req(self, branches=self.branch_data, branch=-1)

def run(self):
    # gather characters and save them
    alphabet = ""
    for inp in self.input()['collection'].targets.values():
        alphabet += inp.load()["char"]
  ....

to obtain the following graph: image

And finally the result which I was expecting to see: image

can be done by changing the CreateFullAlphabet:

def requires(self):
    return CreatePartialAlphabet.req(self)

def run(self):
    # loop over all targets holding partial alphabet fractions and concat them
    inputs = self.input()["collection"].targets
    parts = [
        inp.load().strip()
        for inp in inputs.values()
    ]
    alphabet = "-".join(parts)

I would really appreciate if you could help me with that, struggled a lot with this trying to find the reason for workload_requires. Thank you for reading

solo-driven commented 1 month ago

*Updated the url to example

riga commented 1 month ago

Hi @solo-driven ,

in general, workflow_requires() is meant to define the requirements of a workflow itself. These requirements are resolved before any of the actual (branch) tasks run.

To understand this concept, one should distinguish between local and remote workflows (those that can submit jobs to (e.g.) batch systems), that work slightly differently in the way they initiate their branch tasks. For this, it is imperative to differentiate between the run() method you define on task level (belonging to the branch task), and the run() method of the workflow (encapsulated by the so-called workflow_proxy in the background).

Remote workflows have a run() implementation that send jobs to batch systems. Each job then executes one or more law tasks with the exact command you used to start the workflow - with the addition of the corresponding --branch N parameter(s).

Usually, before jobs can be submitted, one needs to make sure that certain conditions are met, e.g., that certain software is pre-bundled and provided to the batch system (for those that need that). This is exactly where the workflow_requires() method is important. These conditions can be modeled with tasks (in the example above, it could be a task UploadSoftware), and one would typically want to declare as a dependency. However, it's a dependency of the workflow, but not of each individual task.

Local workflows often don't need these extra dependencies that ensure that branch tasks can be run, since you're already in the correct environment. However, you are free to declare them regardless if it fits your use case. There is even a parameter predefined on all workflows, --pilot, whose value you can use in your implementation of workflow_requires() to dynamically add or remove certain workflow requirements. But again, it's fully up to you if you make use of that.

Side note: have a look at how local workflows trigger their branch tasks. There are two options: declare as dependency, or yield as dynamic dependencies (which is a luigi pattern).


That being said, all your example cases are valid and the actual decision of what you declare as a workflow requirement is a design choice you are free to make.

solo-driven commented 1 month ago

But why did you use workflow_requires for branches manipulation in that example? As you said it is for controlling the dependency of the whole workflow. Like setting up an environment. (I read your last comment, so probably it is not a best example for it?)

Also I noticed that controlling branchesparameter of any dependent worklfow is only possible in workflow_requiresand not possible in requires. Can you explain why?

riga commented 1 month ago

(I read your last comment, so probably it is not a best example for it?)

Yeah, it probably is not a good example. The linked task is the proxy that lives underneath the workflow and that implements the actual run(), requires() and output() methods that take effect in case a task is a workflow (branch == -1).

Also I noticed that controlling branches parameter of any dependent worklfow is only possible in workflow_requires and not possible in requires. Can you explain why?

The branches (plural) parameter is only a feature of the workflow itself. For specific branch tasks, settings this value has no meaning (since a branch does not have branches on it's own).

solo-driven commented 1 month ago

The last question. Are the any performance differences in the way I "build a dependency tree"? Like in the examples above. In the end we get ~30 tasks which will be distributed by workers right?

And when I specify branches for instance 1:5 will that workflow count as a single task or all the branches will be distributed among workers? If former is true then there should be really no difference at all by the way we build tree

riga commented 1 month ago

The workflow itself will count as a single yet separate task in the tree whose only "payload" is to trigger its branch tasks (either via static or dynamic requirements). All branch tasks will be distributed across --workers in any case, so there shouldn't be any performance difference (except for a very small one during tree building at the very beginning).