iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

repro: parallel stage execution results in error #10465

Closed AnnKatrinBecker closed 1 week ago

AnnKatrinBecker commented 1 week ago

According to the Docs (https://dvc.org/doc/command-reference/repro#parallel-stage-execution) it should be possible to run different arms of the dvc dag in parallel via individual "dvc repro [arm-target]" calls

However if I do this in the same workspace, with 3 parallel arms of the dvc dag running at the same time. Regularly one of them terminates with the following error:

ERROR: Unable to acquire lock. Most likely another DVC process is running or was terminated abruptly.

I guess this is an issue of bad timing were two of the processes try to aquire the lock at the same time.

Yet, I would expect one process to wait for the lock for a little while before terminating with an error.

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-4.18.0-372.103.1.el8_6.x86_64-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/jovyan/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: nfs4 on 10.16.232.4:/px_50..
Caches: local
Remotes: local
Workspace directory: nfs4 on 10.16.232.4:/px_50..
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/f4..
dberenbaum commented 1 week ago

Yes, it is an unfortunate limitation. There is some discussion about it in https://github.com/iterative/dvc/issues/755, so I'm going to close this one as a duplicate and suggest you comment there.