iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.88k stars 1.18k forks source link

repro: stage deps don't support executables in search path #9200

Open sisp opened 1 year ago

sisp commented 1 year ago

Bug Report

Description

DVC doesn't seem to support specifying an executable found in the search path $PATH as a stage dependency which means the stage won't rerun even when the executable has changed.

Stage commands may not always be scripts but also other kinds of executables. They may be locally developed or installed via a third-party package. For instance, I may want to train a YOLO model in one of my DVC stages using the yolo executable of the ultralytics package, so my stage command would be a call of that executable (found in the search within the virtual environment into which I've install ultralytics) and not a local script. When the yolo executable changes (in fact, the relevant code — related to #9195) because I've updated the ultralytics package, I'd like the stage to rerun.

Reproduce

  1. Run dvc init.
  2. Add a toy executable ./bin/hello with the following content:
    #!/bin/bash
    echo "world"

    Then, make it executable:

    chmod +x ./bin/hello
  3. Add the path ./bin to the search path:
    export PATH="$PWD/bin:$PATH"
  4. Add dvc.yaml with the following content:
    stages:
      test:
        cmd: hello > out.txt
        deps:
          - hello
        outs:
          - out.txt
  5. Run dvc repro and observe the following error:
    ERROR: failed to reproduce 'test': [Errno 2] No such file or directory: '$PWD/hello'

When I omit the deps block, then unsurprisingly the stage won't be rerun after I change the executable:

 #!/bin/bash
-echo "world"
+echo "world v2"

Expected

It should be possible to declare an executable as a dependency via the deps field.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.47.1 (pip)
-------------------------
Platform: Python 3.9.13 on Linux-5.13.0-48-generic-x86_64-with-glibc2.31
Subprojects:
    dvc_data = 0.42.3
    dvc_objects = 0.21.1
    dvc_render = 0.2.0
    dvc_task = 0.2.0
    scmrepo = 0.1.15
Supports:
    http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Additional Information (if any):

A possible solution to the problem might the extension of the deps syntax like this:

 deps:
   - ./path/to/file
+  - exe: hello
efiop commented 1 year ago

@sisp The problem of depending on executables is fairly complex overall and we don't have a perfect solution for it. Even detecting changes is challenging as it is not clear where to stop (e.g. main script or all libraries and how deep?). That's why we just suggest specifying your script as a dependency if you think it is suitable.

sisp commented 1 year ago

[...] it is not clear where to stop (e.g. main script or all libraries and how deep?).

See #9195 on that topic. In short, DVC should rather err on the side of too much computation than on the side of false cache hits. It's a tradeoff between efficiency and correctness, but correctness certainly outweighs efficiency. But efficiency could be improved in the future.

That's why we just suggest specifying your script as a dependency if you think it is suitable.

I think that's not sufficient because especially in CI I cannot force-run a stage ad hoc and false cache hits will lead to incorrect results. If DVC supported executables in the search path via deps, then the caching behavior would be the same as for script paths with the current cache key computation (the cache key is the content hash of the executable). So there would be no disadvantage in adding support for executables in the search path. And with #9195 implemented, the cache key for Python- based executables could be extended by taking into account the import tree in the same way as it would be done for Python scripts.

shcheklein commented 1 year ago

@sisp I think the same workaround is possible as I described in the second ticket. You can try to introduce a stage that runs a custom function / script wit the only single purpose - calculate different hashes in a way you want, spits them into a file that your main stage then depends on.