iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.69k stars 1.18k forks source link

repro: add scheduler for parallelising execution jobs #755

Open yukw777 opened 6 years ago

yukw777 commented 6 years ago

When I try to run multiple dvc run commands, I get the following error:

$ dvc run ...
Failed to lock before running a command: Cannot perform the cmd since DVC is busy and locked. Please retry the cmd later.

This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?

Eisbrenner commented 2 years ago

Hi, I may add another use case here. I work partly using an HPC featuring SLURM. Using SLURM, I can and should define task dependencies directly while queuing jobs.

Assuming a graph like this:

A --> B --> C
  \-> D

I can schedule jobs of the form

#!/bin/bash
#slurm settings

dvc repro dvc.yaml:stageA

with

#!/bin/bash

# queue job A with no dependencies
stdout=$(sbatch <kwags> job_A.sh)
id_A=${stdout##* } # the PID is given as part of a sentence

# queue job B with id_A as dependency
stdout=$(sbatch <kwags> --dependency=afterok:id_A job_B.sh)
id_B=${stdout##* }

# queue job C with id_B as dependency
sbatch <kwags> --dependency=afterok:id_B job_C.sh

# queue job D with id_A as dependency
sbatch <kwags> --dependency=afterok:id_A job_D.sh

So here I can run B --> C and D in parallel on different nodes or otherwise separated units.

Only thing missing is a way to disable the dvc-repo lock.

osma commented 2 years ago

@tweddielin said above:

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

I second this 100%! DVC is amazing, but the lack of parallel execution of pipeline stages is disappointing. I'm working on a machine with many cores and it would be much more efficient to be able to use something like make -j8.

I've tried some workarounds. It seems to be possible to run individual stages in parallel, using dvc repro -s <stagename>, as long as you don't start them at the exact same moment, because then you will hit the lock contention problem. I even tried automating parallel execution of pending stages outside DVC using the jq tool and GNU Parallel, like this:

dvc status --json | jq -r 'keys | join("\n")' | parallel dvc repro -s

but this fails because all the parallel dvc repro -s commands try to acquire the lock at nearly the same time and usually only one of them will succeed. Since the lock appears to be held only for a short while, adding a retry loop with a timeout could help, as mentioned above in several comments (there's also a closed issue #2031 where this was suggested).

I've also used a lot of foreach statements and at least for the use cases I can think of, all the iterations are independent from each other. So if it's difficult to schedule parallel execution of the whole pipeline/DAG, at least stages defined using foreach could be executed in parallel.

itcarroll commented 1 year ago

I am wondering whether the awesome dev team has decided whether the --jobs feature associated with dvc exp run --queue is planned to close this issue? You'd be awesome either way, but maybe more awesome if this issue is still on the table 😉.

Adding a jobs parameter to foreach blocks would be killer.

dberenbaum commented 1 year ago

@itcarroll It's not in our short-term plans for the rest of this year, but it's a highly requested feature, so it's still on the table, and there's no intent to close this issue without addressing it more directly.