Slurm workflows don't work when using multiple workers

riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity

http://law.readthedocs.io

BSD 3-Clause "New" or "Revised" License

98 stars 41 forks source link

Slurm workflows don't work when using multiple workers #126

Closed lmoureaux closed 1 year ago

lmoureaux commented 2 years ago

I'm running a task chain that involves a Slurm workflow. I would like to run it with multiple workers to parallelize some parts of the chain that benefit from parallelization but are too short to benefit from a Slurm job.

Unfortunately, when running with multiple workers I have the following issues:

Two instances of the same workflow (with different parameters) are not considered separate. So if I have a Workflow workflow, Workflow --mass=1 and Workflow --mass=2 can't be submitted in parallel, be it to the same scheduler or to different ones.
If I have a diamond dependency graph with two tasks that depend on a single workflows:
```
flowchart LR

A --> B
A --> B'
B --> Workflow
B' --> Workflow
```
and I try to run this in parallel with --workers 2, the workflow gets submitted twice.

riga commented 2 years ago

Hey @lmoureaux ,

This should in general work of course, so it sounds like a minor misconfiguration of your tasks / parameters.

Two instances of the same workflow (with different parameters) are not considered separate

are the parameters significant and lead to different output target paths? It sounds like different parameters are mapped to the same task instance or same output files.

lmoureaux commented 1 year ago

Indeed, both instances seem to work fine if I modify CreateAlphabet to support Greek and Latin. Must be on my side, sorry

lmoureaux commented 1 year ago

Reopening because I got a MWE demonstrating the bug with CreateAlphabet. You can find my modifications to the example here.

The steps to reproduce are as follows:

Run law run CreateAlphabet --version 1
Wait one minute until Law polls the jobs for status
From another terminal, run law run CreateAlphabet --version 1 --uppercase

The first (lowercase) alphabet works. The second (uppercase) alphabet doesn't submit jobs and eventually fails. The underlying reason is that data/CreateChars/1/slurm_submission_0To26.json is shared between the tasks. This can be worked around using a dynamic slurm_output_directory(), but having to do this manually would be annoying.

I think it would make sense to avoid such conflicts in the core.

msommerh commented 1 year ago

We managed to get around this issue now on our own end by overwriting the slurm_output_directory method on law.slurm.SlurmWorkflow using the luigi task ID to create a dedicated subfolder. Then the submission json files don't interfere anymore. One could still think whether it makes sense to make something like this the default upstream. E.g.:

import os
import luigi
import law

class SlurmWorkflow(law.slurm.SlurmWorkflow):

    [...]

    def slurm_output_directory(self):
        # the directory where submission meta data should be stored
        task_family = self.get_task_family()
        task_params = self.to_str_params(self.get_params())
        task_id = luigi.task.task_id_str(task_family, task_params)
        return law.LocalDirectoryTarget(
            os.path.join(self.local_path(), "slurm", task_id))

riga commented 1 year ago

In order to prevent clashes between differently parameterized workflows and subsequent interference of outputs, there is a slurm_output_postfix() method,

https://github.com/riga/law/blob/221306af1a0522f457f4b71b6da6a2aa5e1e568f/law/contrib/slurm/workflow.py#L198-L199

that can be overwritten. It is used here,

https://github.com/riga/law/blob/221306af1a0522f457f4b71b6da6a2aa5e1e568f/law/workflow/remote.py#L480-L489

and defines the default postfix that is appended to workflow outputs.

Closing now, but of course feel free to re-open if you have follow up questions.

lmoureaux commented 1 year ago

I would suggest to change the default to include task_id_str(), because the current one leads to collisions as demonstrated in my example. get_branches_repr() isn't enough.