Open yukw777 opened 6 years ago
Hi @yukw777 !
I agree, it is inconvenient. For now dvc can only run single instance per repo, because it has to make sure that your dvc files/deps/outputs don't collide with each other and you don't get race conditions in your project. For dvc run
case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands. I will rename this issue and add it to our TODO list, but I'm not sure whether or not it is going to make it into 0.9.8 release. Btw, do you only need multiprocessing for dvc run
or for dvc repro
as well?
If you really need to run a few heavy experiments simultaneously right now, I would suggest a rather ugly but simple workaround of just copying your project's directory and running one dvc run
per copy. Would that be feasible for you?
Thanks, Ruslan
Hi Ruslan,
I think it makes sense to add multiprocessing for dvc repro
as well. This will allow me to reproduce multiple experiments at once.
Yeah I was kind of doing what you suggested already. Thanks again for your speedy response! Very excited to use 0.9.8
.
Moving this to 0.9.9 TODO.
Hi guys, is this still on the table? ;)
Hi @prihoda !
It is, we just didn't have time to tackle it yet. With such amount of demand, we will try to move it up the priority list. :slightly_smiling_face:
Thank you.
Hi @efiop
For
dvc run
case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands.
So all that is needed for parallel execution is to make sure no running stage's output path is the same as another running stage's input path, right? Two running stages having the same input path should not be a problem. So a lock file would be only created for dependencies and only checked for outputs. Can that be solved using zc.lockfile
just like with the global lock file?
@prihoda Yes, I think so :)
Looks like Portalocker provides shared/exclusive locks which could be used as read/write locks https://portalocker.readthedocs.io/en/latest/portalocker.html
Once the multiprocessing is added, hopefully it can extend to dvc repro
as well, perhaps controlled via the --jobs
parameter.
I followed the step verbatim on https://dvc.org/doc/tutorial/define-ml-pipeline and in the previous step. This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip
. I've also tried desperately deleting .dvc/lock
and .dvc/updater.lock
, which show up every single time I look just to see if it does something.
What's going on?
This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip
https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101 - NFS locking issue as was discussed on Discord
Related ^^ #1918
I would find this useful, particularly for dvc repro
as well.
I run my experiments on an HPC cluster, and would like to be able to kick off multiple large, non-overlapping subcomponents of the experiment in parallel. Right now the locking prevents that; also, locking keeps me from performing other operations (such as uploading the results of a previous computation to a remote) while a job is working.
My workaround right now is to create Git worktrees. I have a script that creates a worktree, configures DVC to share a cache with the original repository, checks out the DVC outputs, and then starts the real job. When it is finished, I have to manually merge the results back into my project.
It works, but is quite ugly and tedious.
We've had a discussion about some quick-ish ways to help users with this and came up with two possible partial solutions:
1) As a quick general solution, we could implement --no-lock
option for run
and repro
that would shift the responsibility to the user for making sure that his stages don't overlap. This solution seems trivial, except that we would need to revisit the way we acquire lock for our state db, because currently we take it really early and we sit there with it waiting for the command to finish. Seems like it would require a slight adjustment to take state db lock only when we really need it (self.state.get/save/etc in RemoteBASE).
2) As @shcheklein suggested, another option might be to implement locking per-pipeline, so you could at least run stages in unrelated pipelines. This one is simple, but not as trivial as 1) and would still require solving state db lock problem. Also this approach would only work in specific scenarios.
And the final option is to come up with general solution that would still require solving 1) and would probably look like per-output lock files.
What do you guys think?
I would also appreciate being able to run the pipeline in parallel on HPC.
The --no-lock
option would definitely be useful as a quick hack. But what is the worst case scenario if the user runs multiple related stages at the same time?
For the long term, I think it makes sense to implement a "full" solution that allows running any stages in parallel based on the DAG dependencies - with waiting for dependent stages instead of failing.
What if we don't allow running multiple instances of DVC at the same time, but actually implement reproduction of the pipeline in parallel where applicable based on the DAG dependencies (just like make --jobs
) within one DVC repro? Then, when running dvc repro A.dvc B.dvc
it would resolve the DAG and run both of the stages (and possibly their dependencies) in parallel using an execution plan. This would not support running multiple dvc run
instances but that could be resolved by dvc run --no-exec
and then dvc repro
.
If that is not possible, could we somehow use SQLite for per-output locking?
From a high-level perspective, the execution plan implementation could actually be really simple, right? Get set of stages that can be run in parallel from DAG, run them in a pool, wait for all to finish, repeat on next set of stages.
@prihoda Indeed, repro-based implementation for the DAG would be pretty simple, though it would still require solving locking problems, since it would be equivalent to --no-lock but where dvc itself is ensuring that stages don't overlap. Also sometimes you might want to run two dvc repro
s for two branches in your dag, but not the whole dag, which we might also be able to support with this approach, but only if user is starting them at the same time with dvc repro a.dvc b.dvc
, because otherwise he would still run into locks. We might start with implementing --no-lock
and then making our way to a proper full per-output based locking, after which we could depricae --no-lock
. Or, we could aim for the full implementation right now, but need to better estimate the complexity of it, to decide if we need a --no-lock
time in the meantime.
I guess running multiple stages in parallel within the same DVC repro (using DAG execution plan) and running multiple instances of DVC at the same time (per-file locking) are both relevant for some specific usecases and they compliment each other, so they could both be implemented. You want to be able to run or reproduce a stage when an unrelated stage is already running, but you also want to be able to reproduce a big pipeline with thousands of stages that could run in parallel on multiple threads or using a HPC grid scheduler.
It seems like some of the lock checking could be relaxed. Currently if I'm doing a dvc repro
, we can't run other DVC commands like dvc status
, or even dvc status file.dvc
where the file doesn't relate to the running command in any way. Might be a good start.
Is it in the meanwhile possible to avoid locking when executing dvc run
? According to the thread it looks like, but the docs do not mention any flag for this.
@amjadsaadeh Currently, there is no way to do that. What you can do is run your command without dvc run
, then create a dvc file with dvc run --no-exec
and then dvc commit my.dvc
, so it will save everything without re-running your command.
Some thoughts after reading this thread for 28 more times and talking with @shcheklein.
There are 3 possible types of parallelism in DVC (in a single repo):
dvc run some_cmd.sh
, dvc repro mystage.dvc
and dvc repro data/pre-process.dvc
. The user is responsible for level of parallelism (how many commands to run in parallel and which commands). Problem: today the global DVC lock prevents this scenario. A fix is coming #2584.dvc run/repro
can run some stages in parallel. DVC should decide (not user) which stages to run in parallel. Problem: The user needs no specify a level of parallelism (or amount of resources) for each stage (100% resources utilization by default) and DVC needs a special, internal scheduler.--jobs 8
) while DVC runs all the commands within with limitation.Please let me know if you see any other type or a good example.
For better communication, we can create 3 separate issues. Especially when #2584 will be merged.
@dmpetrov That sounds like a good breakdown (although I don't fully understand 3).
It would be really nice for 2 to be designed with some abstraction future extensibility over parallelism interfaces. Doesn't need to do that in V1, but long-term I could see it being very useful for DVC to be able to use an external job scheduler such as SLURM for running the individual stages. A lot of US research academics have access to classical HPC clusters managed with SLURM (or PBS or something like that).
(2) built on multiprocessing or loki will be super useful already, as individual stages don't often need an entire node in my workflows.
@mdekstrand it's a good point about SLURM. I can add Dask here and let's pretend I forget to add Hadoop/Yarn in this list :) since it is not fully supported. All of these should be definitely discussed when we start implementing (2).
(3) include some handy special cases - you can find more in the examples #331 and #1018. The major difference - (3) is not only about parallel execution it's also kind of generates/extends DAG. So, a single stage can describe many "small" commands (by some pattern) and a level of parallelism for the commands.
@mdekstrand are you using SLURM in academic environment? I just met an engineer at PyData.LA conf - they use SLURM in industry project. I’m wondering how common this is.
@dmpetrov Yes, Boise State's research computing cluster runs on SLURM. My understanding is that it is the current standard for academic clusters (including the big NSF-funded clusters and national labs).
Hi all, @mdekstrand and @dmpetrov. I do not have any strong opinions about DVC and parallelisation, I can just give my point of view. I’m not sure, what of the three cases, Python Dask lies within, but that is what we use for parallelisation. Meaning, each DVC stage contains the code that defines and executes a delayed Dask pipeline on a remote machine. In regards to SLURM, I have used it academia as well for submitting compute jobs to clusters, during my master at Aalborg university, Denmark.
@PeterFogh thank you for sharing your opinion!
Could you please share more details about your Dask case? Do you need parallelization to speed up a single job that cannot be done in a single machine? Or you use Dask to run several different jobs.
@dmpetrov, yes, of course. We benefit from Dask in multiple ways.
A usual use case where both parallelisation question are in focus, is when we featurize our data. Here we can process each feature separately in parallel, but we can also process a single feature in parallel, using chucking of the feature column.
@efiop should we reopen this issue? As we discussed - only one parallelization scenario out of 3 was implemented.
@dmpetrov Sorry, closed accidentally. Was about to comment here that first part is merged and released now (0.77.3), so everyone is welcomed to give it a shot 🙂
@dmpetrov , maybe changing the title?
At this point, you can have multiple instances of dvc run
:)
@mroutis renamed. Thanks for the heads up 🙂
Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.
Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.
@elegz , I couldn't replicate this:
dvc run -f 1.dvc 'sleep 1000' &
dvc run -f 2.dvc 'sleep 2000' &
dvc run -f 3.dvc 'sleep 3000' &
First need to make our locks wait instead of erroring-out, so that users could start scheduling it themselves. Then we could proceed with in-house repro parallelization.
Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.
@elegz , I couldn't replicate this:
dvc run -f 1.dvc 'sleep 1000' & dvc run -f 2.dvc 'sleep 2000' & dvc run -f 3.dvc 'sleep 3000' &
Sometimes I was able to run more than 2 commands but not any desirable number. Seems I've found a workaround. I inserted an increasing delay before each dvc command execution (every next execution is shifted by 1 sec to the previous one, so the N-th command starts in N secs from the beginning). No problem so far.
@elegz The reason is that we don't release the locks right away, and then we take them back after the command is done running. We are discussing making dvc commands wait in those cases, instead of erroring out. As described in my commend above.
@elegz The reason is that we don't release the locks right away, and then we take them back after the command is done running. We are discussing making dvc commands wait in those cases, instead of erroring out. As described in my commend above.
Yes, that would be great. This scenario is very relevant to my team. Sometimes we run tens of experiments. In the meantime, everyone can use the solution I proposed above.
Nothing particular to add here other than a +1 from me! My workflow usually involves lots and lots of quite small steps, executing on an HPC cluster, so parallelism would be a great help.
I would also benefit very much from parallel executions
Cross-post from https://github.com/iterative/dvc/issues/3633#issuecomment-736508018
We could go one step further based on option 1 in https://github.com/iterative/dvc/issues/755#issuecomment-561031849 by letting users to specify how jobs/stages are running:
# jobs.yaml
jobs:
- name: extract_data@*
limit: 1 # 1 means no concurrency, this can be set as default value
- name: prepare@*
limit: 4 # at most 4 concurrent jobs
env:
JULIA_NUM_THREADS: 8
- train@*
limit: 4
env:
# if it's a scalar, apply to all jobs
JULIA_NUM_THREADS: 8
# if it's a list, iterate one by one
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- 3
- name: evaluate@*
limit: 3
env:
JULIA_NUM_THREADS: 8
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- name: summary
extract_data@*
follows the same glob syntax in #4976
I never used slurm so I don't know if this is applicable to cluster cases. I feel like dvc should delegate distribution computation tasks to other existing tools.
Do I understand correctly that if one step of the pipeline (e.g. retrieving webpages) requires 100 micro-tasks running in parallel, that this step is a bad fit for doing within dvc?
@dmpetrov in https://github.com/iterative/dvc/issues/755#issuecomment-561031849 describes 3 methods of parallelization and I assume you need 2 for this. Let me know if there is a workaround.
@turian The workaround is to simply parallelize it in your code(1 stage in dvc pipeline), and not create 100 micro stages in dvc pipeline.
@efiop The issue is that each of the little jobs has a pipeline. Download webpage => extract text without boilerplate => further NLP.
I haven't read the complete conversation but my 2c here are that while it sounds like an amazing feature, we prob should think about this several times.
We've already had some ongoing performance issues (mainly around file system manipulation and network connectivity) which required pretty sophisticated implementations. These are expensive to develop and maintain. Do we want to also worry about multithreading in this repo?
We can still provide a solution for this, just not necessarily in the DVC product (cc @dberenbaum). There's CML for example, so you can prob. setup a distributed CI system that runs DVC pipelines in parallel (cc @DavidGOrtega?) + adds a bunch of other benefits. If needed DVC could have small modifications to support this usage, or other alternative solutions (e.g. make a plugin/extension independent from this repo). And we can document/blog about setting up systems for parallel execution of DVC pipelines.
That said if a multithreading refactor could speed up regular operations automatically (more general than just for executing commands) that would definitely be a welcome enhancement.
See user insight in https://groups.google.com/u/1/a/iterative.ai/g/support/c/zvJ4WnfTGKM/m/ZOckCIwIEwAJ
Just want to see if parallel execution is still on the roadmap? By parallel execution, I mean execute stages(jobs) in the defined pipeline(DAG) of a repo in parallel instead of parallelizing multiple pipelines. Just like what you can do with make -j4
or luigi --workers=4
for luigi.
I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".
@tweddielin Thanks for the kind words and your support! It's still something we'd like to do, but a lot of efforts right now are focused on data optimization rather than pipeline optimization. Hopefully, we will have a chance to return to pipelines once we complete that work.
@dmpetrov @efiop Let's discuss the priority of this one as we start to think about planning for next quarter. I think we should probably move it to p2 for now as we are unlikely to work on it for at least the rest of this quarter, but we should keep it in mind as a priority for next quarter.
When I try to run multiple
dvc run
commands, I get the following error:This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?