Open yukw777 opened 6 years ago
Hi, I may add another use case here. I work partly using an HPC featuring SLURM. Using SLURM, I can and should define task dependencies directly while queuing jobs.
Assuming a graph like this:
A --> B --> C
\-> D
I can schedule jobs of the form
#!/bin/bash
#slurm settings
dvc repro dvc.yaml:stageA
with
#!/bin/bash
# queue job A with no dependencies
stdout=$(sbatch <kwags> job_A.sh)
id_A=${stdout##* } # the PID is given as part of a sentence
# queue job B with id_A as dependency
stdout=$(sbatch <kwags> --dependency=afterok:id_A job_B.sh)
id_B=${stdout##* }
# queue job C with id_B as dependency
sbatch <kwags> --dependency=afterok:id_B job_C.sh
# queue job D with id_A as dependency
sbatch <kwags> --dependency=afterok:id_A job_D.sh
So here I can run B --> C
and D
in parallel on different nodes or otherwise separated units.
Only thing missing is a way to disable the dvc-repo lock.
@tweddielin said above:
I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".
I second this 100%! DVC is amazing, but the lack of parallel execution of pipeline stages is disappointing. I'm working on a machine with many cores and it would be much more efficient to be able to use something like make -j8
.
I've tried some workarounds. It seems to be possible to run individual stages in parallel, using dvc repro -s <stagename>
, as long as you don't start them at the exact same moment, because then you will hit the lock contention problem. I even tried automating parallel execution of pending stages outside DVC using the jq
tool and GNU Parallel, like this:
dvc status --json | jq -r 'keys | join("\n")' | parallel dvc repro -s
but this fails because all the parallel dvc repro -s
commands try to acquire the lock at nearly the same time and usually only one of them will succeed. Since the lock appears to be held only for a short while, adding a retry loop with a timeout could help, as mentioned above in several comments (there's also a closed issue #2031 where this was suggested).
I've also used a lot of foreach
statements and at least for the use cases I can think of, all the iterations are independent from each other. So if it's difficult to schedule parallel execution of the whole pipeline/DAG, at least stages defined using foreach
could be executed in parallel.
I am wondering whether the awesome dev team has decided whether the --jobs
feature associated with dvc exp run --queue
is planned to close this issue? You'd be awesome either way, but maybe more awesome if this issue is still on the table 😉.
Adding a jobs
parameter to foreach
blocks would be killer.
@itcarroll It's not in our short-term plans for the rest of this year, but it's a highly requested feature, so it's still on the table, and there's no intent to close this issue without addressing it more directly.
When I try to run multiple
dvc run
commands, I get the following error:This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?