iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.69k stars 1.18k forks source link

repro: add scheduler for parallelising execution jobs #755

Open yukw777 opened 6 years ago

yukw777 commented 6 years ago

When I try to run multiple dvc run commands, I get the following error:

$ dvc run ...
Failed to lock before running a command: Cannot perform the cmd since DVC is busy and locked. Please retry the cmd later.

This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?

efiop commented 6 years ago

Hi @yukw777 !

I agree, it is inconvenient. For now dvc can only run single instance per repo, because it has to make sure that your dvc files/deps/outputs don't collide with each other and you don't get race conditions in your project. For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands. I will rename this issue and add it to our TODO list, but I'm not sure whether or not it is going to make it into 0.9.8 release. Btw, do you only need multiprocessing for dvc run or for dvc repro as well?

If you really need to run a few heavy experiments simultaneously right now, I would suggest a rather ugly but simple workaround of just copying your project's directory and running one dvc run per copy. Would that be feasible for you?

Thanks, Ruslan

yukw777 commented 6 years ago

Hi Ruslan,

I think it makes sense to add multiprocessing for dvc repro as well. This will allow me to reproduce multiple experiments at once.

Yeah I was kind of doing what you suggested already. Thanks again for your speedy response! Very excited to use 0.9.8.

efiop commented 6 years ago

Moving this to 0.9.9 TODO.

prihoda commented 6 years ago

Hi guys, is this still on the table? ;)

efiop commented 6 years ago

Hi @prihoda !

It is, we just didn't have time to tackle it yet. With such amount of demand, we will try to move it up the priority list. :slightly_smiling_face:

Thank you.

prihoda commented 5 years ago

Hi @efiop

For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands.

So all that is needed for parallel execution is to make sure no running stage's output path is the same as another running stage's input path, right? Two running stages having the same input path should not be a problem. So a lock file would be only created for dependencies and only checked for outputs. Can that be solved using zc.lockfile just like with the global lock file?

efiop commented 5 years ago

@prihoda Yes, I think so :)

prihoda commented 5 years ago

Looks like Portalocker provides shared/exclusive locks which could be used as read/write locks https://portalocker.readthedocs.io/en/latest/portalocker.html

AlJohri commented 5 years ago

Once the multiprocessing is added, hopefully it can extend to dvc repro as well, perhaps controlled via the --jobs parameter.

mathemaphysics commented 5 years ago

I followed the step verbatim on https://dvc.org/doc/tutorial/define-ml-pipeline and in the previous step. This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip. I've also tried desperately deleting .dvc/lock and .dvc/updater.lock, which show up every single time I look just to see if it does something.

What's going on?

shcheklein commented 5 years ago

This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip

https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101 - NFS locking issue as was discussed on Discord

ghost commented 5 years ago

Related ^^ #1918

mdekstrand commented 5 years ago

I would find this useful, particularly for dvc repro as well.

I run my experiments on an HPC cluster, and would like to be able to kick off multiple large, non-overlapping subcomponents of the experiment in parallel. Right now the locking prevents that; also, locking keeps me from performing other operations (such as uploading the results of a previous computation to a remote) while a job is working.

My workaround right now is to create Git worktrees. I have a script that creates a worktree, configures DVC to share a cache with the original repository, checks out the DVC outputs, and then starts the real job. When it is finished, I have to manually merge the results back into my project.

It works, but is quite ugly and tedious.

efiop commented 5 years ago

We've had a discussion about some quick-ish ways to help users with this and came up with two possible partial solutions:

1) As a quick general solution, we could implement --no-lock option for run and repro that would shift the responsibility to the user for making sure that his stages don't overlap. This solution seems trivial, except that we would need to revisit the way we acquire lock for our state db, because currently we take it really early and we sit there with it waiting for the command to finish. Seems like it would require a slight adjustment to take state db lock only when we really need it (self.state.get/save/etc in RemoteBASE).

2) As @shcheklein suggested, another option might be to implement locking per-pipeline, so you could at least run stages in unrelated pipelines. This one is simple, but not as trivial as 1) and would still require solving state db lock problem. Also this approach would only work in specific scenarios.

And the final option is to come up with general solution that would still require solving 1) and would probably look like per-output lock files.

What do you guys think?

prihoda commented 5 years ago

I would also appreciate being able to run the pipeline in parallel on HPC.

The --no-lock option would definitely be useful as a quick hack. But what is the worst case scenario if the user runs multiple related stages at the same time?

For the long term, I think it makes sense to implement a "full" solution that allows running any stages in parallel based on the DAG dependencies - with waiting for dependent stages instead of failing.

What if we don't allow running multiple instances of DVC at the same time, but actually implement reproduction of the pipeline in parallel where applicable based on the DAG dependencies (just like make --jobs) within one DVC repro? Then, when running dvc repro A.dvc B.dvc it would resolve the DAG and run both of the stages (and possibly their dependencies) in parallel using an execution plan. This would not support running multiple dvc run instances but that could be resolved by dvc run --no-exec and then dvc repro.

If that is not possible, could we somehow use SQLite for per-output locking?

prihoda commented 5 years ago

From a high-level perspective, the execution plan implementation could actually be really simple, right? Get set of stages that can be run in parallel from DAG, run them in a pool, wait for all to finish, repeat on next set of stages.

efiop commented 5 years ago

@prihoda Indeed, repro-based implementation for the DAG would be pretty simple, though it would still require solving locking problems, since it would be equivalent to --no-lock but where dvc itself is ensuring that stages don't overlap. Also sometimes you might want to run two dvc repros for two branches in your dag, but not the whole dag, which we might also be able to support with this approach, but only if user is starting them at the same time with dvc repro a.dvc b.dvc, because otherwise he would still run into locks. We might start with implementing --no-lock and then making our way to a proper full per-output based locking, after which we could depricae --no-lock. Or, we could aim for the full implementation right now, but need to better estimate the complexity of it, to decide if we need a --no-lock time in the meantime.

prihoda commented 5 years ago

I guess running multiple stages in parallel within the same DVC repro (using DAG execution plan) and running multiple instances of DVC at the same time (per-file locking) are both relevant for some specific usecases and they compliment each other, so they could both be implemented. You want to be able to run or reproduce a stage when an unrelated stage is already running, but you also want to be able to reproduce a big pipeline with thousands of stages that could run in parallel on multiple threads or using a HPC grid scheduler.

kenahoo commented 5 years ago

It seems like some of the lock checking could be relaxed. Currently if I'm doing a dvc repro, we can't run other DVC commands like dvc status, or even dvc status file.dvc where the file doesn't relate to the running command in any way. Might be a good start.

amjadsaadeh commented 5 years ago

Is it in the meanwhile possible to avoid locking when executing dvc run? According to the thread it looks like, but the docs do not mention any flag for this.

efiop commented 5 years ago

@amjadsaadeh Currently, there is no way to do that. What you can do is run your command without dvc run, then create a dvc file with dvc run --no-exec and then dvc commit my.dvc, so it will save everything without re-running your command.

dmpetrov commented 4 years ago

Some thoughts after reading this thread for 28 more times and talking with @shcheklein.

There are 3 possible types of parallelism in DVC (in a single repo):

  1. Out-of-process parallelism. Run multiple commands at the same time dvc run some_cmd.sh, dvc repro mystage.dvc and dvc repro data/pre-process.dvc. The user is responsible for level of parallelism (how many commands to run in parallel and which commands). Problem: today the global DVC lock prevents this scenario. A fix is coming #2584.
  2. In-process parallelism. dvc run/repro can run some stages in parallel. DVC should decide (not user) which stages to run in parallel. Problem: The user needs no specify a level of parallelism (or amount of resources) for each stage (100% resources utilization by default) and DVC needs a special, internal scheduler.
  3. Let's call it Semantic parallelism. This is a special case of combination (1) and (2). A user specifies a command pattern (command for each file in a dir #331 or a grid of input parameters #1018) and a level of parallelism (--jobs 8) while DVC runs all the commands within with limitation.

Please let me know if you see any other type or a good example.

For better communication, we can create 3 separate issues. Especially when #2584 will be merged.

mdekstrand commented 4 years ago

@dmpetrov That sounds like a good breakdown (although I don't fully understand 3).

It would be really nice for 2 to be designed with some abstraction future extensibility over parallelism interfaces. Doesn't need to do that in V1, but long-term I could see it being very useful for DVC to be able to use an external job scheduler such as SLURM for running the individual stages. A lot of US research academics have access to classical HPC clusters managed with SLURM (or PBS or something like that).

(2) built on multiprocessing or loki will be super useful already, as individual stages don't often need an entire node in my workflows.

dmpetrov commented 4 years ago

@mdekstrand it's a good point about SLURM. I can add Dask here and let's pretend I forget to add Hadoop/Yarn in this list :) since it is not fully supported. All of these should be definitely discussed when we start implementing (2).

(3) include some handy special cases - you can find more in the examples #331 and #1018. The major difference - (3) is not only about parallel execution it's also kind of generates/extends DAG. So, a single stage can describe many "small" commands (by some pattern) and a level of parallelism for the commands.

dmpetrov commented 4 years ago

@mdekstrand are you using SLURM in academic environment? I just met an engineer at PyData.LA conf - they use SLURM in industry project. I’m wondering how common this is.

mdekstrand commented 4 years ago

@dmpetrov Yes, Boise State's research computing cluster runs on SLURM. My understanding is that it is the current standard for academic clusters (including the big NSF-funded clusters and national labs).

PeterFogh commented 4 years ago

Hi all, @mdekstrand and @dmpetrov. I do not have any strong opinions about DVC and parallelisation, I can just give my point of view. I’m not sure, what of the three cases, Python Dask lies within, but that is what we use for parallelisation. Meaning, each DVC stage contains the code that defines and executes a delayed Dask pipeline on a remote machine. In regards to SLURM, I have used it academia as well for submitting compute jobs to clusters, during my master at Aalborg university, Denmark.

dmpetrov commented 4 years ago

@PeterFogh thank you for sharing your opinion!

Could you please share more details about your Dask case? Do you need parallelization to speed up a single job that cannot be done in a single machine? Or you use Dask to run several different jobs.

PeterFogh commented 4 years ago

@dmpetrov, yes, of course. We benefit from Dask in multiple ways.

  1. We can sent computation to a remote machine or cluster, all nicely hidden away with a common abstract interface. I think this lies within your question of “you use Dask to run several different jobs”. Additionally, we can also test local parallelisation just by changing on line of code.
  2. In regards to parallelisation, we often have to much data to process it sequential or fit into local memory. In such cases, parallelisation, e.g. using dask dataframe, is a big benefit or offload the computation to a remote machine with much more memory. I think this lies within you question of “Do you need parallelization to speed up a single job that cannot be done in a single machine”

A usual use case where both parallelisation question are in focus, is when we featurize our data. Here we can process each feature separately in parallel, but we can also process a single feature in parallel, using chucking of the feature column.

dmpetrov commented 4 years ago

@efiop should we reopen this issue? As we discussed - only one parallelization scenario out of 3 was implemented.

efiop commented 4 years ago

@dmpetrov Sorry, closed accidentally. Was about to comment here that first part is merged and released now (0.77.3), so everyone is welcomed to give it a shot 🙂

ghost commented 4 years ago

@dmpetrov , maybe changing the title?

At this point, you can have multiple instances of dvc run :)

efiop commented 4 years ago

@mroutis renamed. Thanks for the heads up 🙂

elegz commented 4 years ago

Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.

ghost commented 4 years ago

Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.

@elegz , I couldn't replicate this:

dvc run -f 1.dvc 'sleep 1000' &
dvc run -f 2.dvc 'sleep 2000' &
dvc run -f 3.dvc 'sleep 3000' &
efiop commented 4 years ago

First need to make our locks wait instead of erroring-out, so that users could start scheduling it themselves. Then we could proceed with in-house repro parallelization.

elegz commented 4 years ago

Can only 2 dvc run / repro commands be executed simultaneously? Everything >2 meets dvc lock.

@elegz , I couldn't replicate this:

dvc run -f 1.dvc 'sleep 1000' &
dvc run -f 2.dvc 'sleep 2000' &
dvc run -f 3.dvc 'sleep 3000' &

Sometimes I was able to run more than 2 commands but not any desirable number. Seems I've found a workaround. I inserted an increasing delay before each dvc command execution (every next execution is shifted by 1 sec to the previous one, so the N-th command starts in N secs from the beginning). No problem so far.

efiop commented 4 years ago

@elegz The reason is that we don't release the locks right away, and then we take them back after the command is done running. We are discussing making dvc commands wait in those cases, instead of erroring out. As described in my commend above.

elegz commented 4 years ago

@elegz The reason is that we don't release the locks right away, and then we take them back after the command is done running. We are discussing making dvc commands wait in those cases, instead of erroring out. As described in my commend above.

Yes, that would be great. This scenario is very relevant to my team. Sometimes we run tens of experiments. In the meantime, everyone can use the solution I proposed above.

charlesbaynham commented 4 years ago

Nothing particular to add here other than a +1 from me! My workflow usually involves lots and lots of quite small steps, executing on an HPC cluster, so parallelism would be a great help.

janvainer commented 3 years ago

I would also benefit very much from parallel executions

johnnychen94 commented 3 years ago

Cross-post from https://github.com/iterative/dvc/issues/3633#issuecomment-736508018


We could go one step further based on option 1 in https://github.com/iterative/dvc/issues/755#issuecomment-561031849 by letting users to specify how jobs/stages are running:

# jobs.yaml
jobs:
  - name: extract_data@*
    limit: 1 # 1 means no concurrency, this can be set as default value
  - name: prepare@*
    limit: 4 # at most 4 concurrent jobs
    env:
      JULIA_NUM_THREADS: 8
  - train@*
    limit: 4
    env:
      # if it's a scalar, apply to all jobs
      JULIA_NUM_THREADS: 8
      # if it's a list, iterate one by one
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
        - 3
  - name: evaluate@*
    limit: 3
    env:
      JULIA_NUM_THREADS: 8
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
  - name: summary

extract_data@* follows the same glob syntax in #4976

I never used slurm so I don't know if this is applicable to cluster cases. I feel like dvc should delegate distribution computation tasks to other existing tools.

turian commented 3 years ago

Do I understand correctly that if one step of the pipeline (e.g. retrieving webpages) requires 100 micro-tasks running in parallel, that this step is a bad fit for doing within dvc?

@dmpetrov in https://github.com/iterative/dvc/issues/755#issuecomment-561031849 describes 3 methods of parallelization and I assume you need 2 for this. Let me know if there is a workaround.

efiop commented 3 years ago

@turian The workaround is to simply parallelize it in your code(1 stage in dvc pipeline), and not create 100 micro stages in dvc pipeline.

turian commented 3 years ago

@efiop The issue is that each of the little jobs has a pipeline. Download webpage => extract text without boilerplate => further NLP.

jorgeorpinel commented 3 years ago

I haven't read the complete conversation but my 2c here are that while it sounds like an amazing feature, we prob should think about this several times.

We've already had some ongoing performance issues (mainly around file system manipulation and network connectivity) which required pretty sophisticated implementations. These are expensive to develop and maintain. Do we want to also worry about multithreading in this repo?

We can still provide a solution for this, just not necessarily in the DVC product (cc @dberenbaum). There's CML for example, so you can prob. setup a distributed CI system that runs DVC pipelines in parallel (cc @DavidGOrtega?) + adds a bunch of other benefits. If needed DVC could have small modifications to support this usage, or other alternative solutions (e.g. make a plugin/extension independent from this repo). And we can document/blog about setting up systems for parallel execution of DVC pipelines.

jorgeorpinel commented 3 years ago

That said if a multithreading refactor could speed up regular operations automatically (more general than just for executing commands) that would definitely be a welcome enhancement.

See user insight in https://groups.google.com/u/1/a/iterative.ai/g/support/c/zvJ4WnfTGKM/m/ZOckCIwIEwAJ

tweddielin commented 2 years ago

Just want to see if parallel execution is still on the roadmap? By parallel execution, I mean execute stages(jobs) in the defined pipeline(DAG) of a repo in parallel instead of parallelizing multiple pipelines. Just like what you can do with make -j4 or luigi --workers=4 for luigi.

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

dberenbaum commented 2 years ago

@tweddielin Thanks for the kind words and your support! It's still something we'd like to do, but a lot of efforts right now are focused on data optimization rather than pipeline optimization. Hopefully, we will have a chance to return to pipelines once we complete that work.

dberenbaum commented 2 years ago

@dmpetrov @efiop Let's discuss the priority of this one as we start to think about planning for next quarter. I think we should probably move it to p2 for now as we are unlikely to work on it for at least the rest of this quarter, but we should keep it in mind as a priority for next quarter.