iterative / dvc

πŸ¦‰ Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

repro: stage cache hit although imported module has changed #9195

Open sisp opened 1 year ago

sisp commented 1 year ago

Bug Report

Description

When defining a stage in dvc.yaml, it is important to declare dependencies via the deps field, so that DVC can leverage its cache to avoid rerunning the stage unnecessarily. Examples in the docs often show stages in which commands use local Python scripts and those scripts are also declared as a dependencies such that changes in those scripts will lead to reruns of the stages. See e.g.:

I think DVC's view on pipeline code organization and dependencies declaration is too narrow and may lead to incorrect stage caching. Let me elaborate.

A Python script is typically not self-contained but imports symbols from other local or third-party Python modules or packages. Especially local imports are subject to (breaking) change and must be declared as stage dependencies, too. In fact, the entire import tree of local imports must be considered for determining whether a stage needs to be rerun. But also third-party dependencies may change in ways that the output of a stage changes without any change in local code, e.g. when the version of a third-party dependency is bumped to a new major version. So ultimately, the entire import tree of a Python script must be considered. Of course, a change anywhere the import tree does not always mean the stage cache must be invalidated because the change may not affect the stage output, so there is a chance of rerunning a stage unnecessarily. But it's better to rerun too often than to miss a relevant change.

Reproduce

  1. Run dvc init.
  2. Add the following dvc.yaml file:
    stages:
      test:
        cmd: python a.py > out.txt
        deps:
          - a.py
        outs:
          - out.txt
  3. Add a.py:

    from b import do
    
    do()
  4. Add b.py:
    def do():
        print("hello")
  5. Run dvc repro and observe that out.txt contains "hello".
  6. Edit b.py to print "hello world" instead.
  7. Run dvc repro and observe that out.txt hasn't changed and DVC prints:
    Stage 'test' didn't change, skipping                               
    Data and pipelines are up to date.

Expected

After editing b.py, dvc repro should have rerun the stage and out.txt should contain "hello world".

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.47.1 (snap)
--------------------------
Platform: Python 3.8.10 on Linux-5.13.0-48-generic-x86_64-with-glibc2.29
Subprojects:
    dvc_data = 0.42.3
    dvc_objects = 0.21.1
    dvc_render = 0.2.0
    dvc_task = 0.2.0
    scmrepo = 0.1.15
Supports:
    azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
    gdrive (pydrive2 = 1.15.1),
    gs (gcsfs = 2023.3.0),
    hdfs (fsspec = 2023.3.0, pyarrow = 11.0.0),
    http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    oss (ossfs = 2021.8.0),
    s3 (s3fs = 2023.3.0, boto3 = 1.24.59),
    ssh (sshfs = 2023.1.0),
    webdav (webdav4 = 0.9.8),
    webdavs (webdav4 = 0.9.8),
    webhdfs (fsspec = 2023.3.0)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Solution idea To solve this problem, DVC could detect whether a stage dependency is written in Python, and if so, analyze the import tree using pydeps, based on which a better cache key could be computed that captures changes also in (transitively) imported files.

WDYT?

dberenbaum commented 1 year ago

@sisp Thanks for the input! Doing it by default may be too aggressive as I'm not sure everyone wants to cache the entire dependency tree. There's some discussion and related issues in https://github.com/iterative/dvc/discussions/7378 about this if you want to dig deeper.

sisp commented 1 year ago

@dberenbaum It's not necessarily the entire dependency tree in a stage, it would be the import tree, so a subset of the dependency tree that is used by the script. Every script may use a different subset whose union is the dependency tree. So updating a dependency may affect not all stages.

I think it's important to note that DVC's current behavior may lead to false cache hits. DVC should rather err on the side of too much computation than on the side of false cache hits. It's a tradeoff between efficiency and correctness, but correctness certainly outweighs efficiency.

sisp commented 1 year ago

But I see there might be complications with using pydeps when, e.g., conditional dependencies are used such as different versions of a dependency depending on the used Python version because naively only the actually used version would be used, so the cache key would be sensitive to the used Python version. πŸ€”

sisp commented 1 year ago

But some kind of import tree analysis is also necessary because a locally imported module may not depend on a third-party dependency, so only relying on requirements.txt or similar (as discussed in #7378) would not be sufficient.

sisp commented 1 year ago

@dberenbaum I recognize that it is non-trivial to compute a correct cache key that takes into account imports, complex dependency specifications, multiple supported Python versions etc. But e.g. when running DVC pipelines primarily via CI, the environment is more stable and homogeneous, so computing a sensible cache key would be easier. How about offering an escape hatch for advanced users that allows the user to compute the cache key via a custom command? E.g.:

cache-key: <command> # that prints the cache key

# or

deps:
  - key: <command>

# or

deps:
  key: <command>

# or ...

This way, e.g. a Poetry user can use poetry export -f requirements.txt --only main which prints the content of a requirements.txt only for main dependencies. Another users could use the (postprocessed) pydeps output if that suits their needs.

The cache-key field would be mutually exclusive with deps.

shcheklein commented 1 year ago

Related https://github.com/iterative/dvc/pull/4363 (it was almost done, may be we can reopen and finalize it? it would close a bunch of issues).

As a workaround (in case you haven't considered it yet), I would introduce a stage that does this:

<command> > hash.file

This way you can imitate the custom hash function I think. Would that work for you @sisp ?

sisp commented 1 year ago

Yes, I believe #4363 would be a better solution than introducing a stage that computes the cache key. :+1:

Ultimately, I'd still love to see the cache key computation based on a proper import tree analysis. πŸ˜•

legendof-selda commented 1 year ago

You can allow glob patterns in deps and run something like this - src/**/*.py - to compute the hash.

Substitute this glob pattern in here to generate hash.

git ls-files -sc <glob pattern> | git hash-object --stdin
m-gris commented 1 year ago

Ultimately, I'd still love to see the cache key computation based on a proper import tree analysis. πŸ˜•

+1 for this suggestion.

sisp commented 1 year ago

Just a few more projects I've come across that might be worth looking into for inspiration or to help in implementing a solution:

efiop commented 1 year ago

Related https://github.com/iterative/dvc/issues/2378

m-legrand commented 2 weeks ago

These transitive dependencies can raise from other things than direct imports, such as a change in a pyproject.toml or a setup.cfg. Moreover using something like pydeps would only solve the issue for Python, not other languages such as R. So I think something a bit more bespoke would be necessary.

On the other hand, computing hashes ourselves as stages seems very hacky: it relies on the user's understanding of DVC's internals, and makes the intention behind a pipeline much more opaque. In #10599, I was suggesting a more declarative interface, something like:

dependencies:
    scripts/*.py:
        - pyproject.toml
        - src/**/*.py

How this would then be implemented is still to be determined, but using "virtual stages" with custom hashes as suggested above would definitely be an option. Substituting stage dependencies with themselves and all their explicit dependencies, into dvc.lock, is another.