Open sisp opened 1 year ago
@sisp Thanks for the input! Doing it by default may be too aggressive as I'm not sure everyone wants to cache the entire dependency tree. There's some discussion and related issues in https://github.com/iterative/dvc/discussions/7378 about this if you want to dig deeper.
@dberenbaum It's not necessarily the entire dependency tree in a stage, it would be the import tree, so a subset of the dependency tree that is used by the script. Every script may use a different subset whose union is the dependency tree. So updating a dependency may affect not all stages.
I think it's important to note that DVC's current behavior may lead to false cache hits. DVC should rather err on the side of too much computation than on the side of false cache hits. It's a tradeoff between efficiency and correctness, but correctness certainly outweighs efficiency.
But I see there might be complications with using pydeps
when, e.g., conditional dependencies are used such as different versions of a dependency depending on the used Python version because naively only the actually used version would be used, so the cache key would be sensitive to the used Python version. π€
But some kind of import tree analysis is also necessary because a locally imported module may not depend on a third-party dependency, so only relying on requirements.txt
or similar (as discussed in #7378) would not be sufficient.
@dberenbaum I recognize that it is non-trivial to compute a correct cache key that takes into account imports, complex dependency specifications, multiple supported Python versions etc. But e.g. when running DVC pipelines primarily via CI, the environment is more stable and homogeneous, so computing a sensible cache key would be easier. How about offering an escape hatch for advanced users that allows the user to compute the cache key via a custom command? E.g.:
cache-key: <command> # that prints the cache key
# or
deps:
- key: <command>
# or
deps:
key: <command>
# or ...
This way, e.g. a Poetry user can use poetry export -f requirements.txt --only main
which prints the content of a requirements.txt
only for main dependencies. Another users could use the (postprocessed) pydeps
output if that suits their needs.
The cache-key
field would be mutually exclusive with deps
.
Related https://github.com/iterative/dvc/pull/4363 (it was almost done, may be we can reopen and finalize it? it would close a bunch of issues).
As a workaround (in case you haven't considered it yet), I would introduce a stage that does this:
<command> > hash.file
hash.file
instead for now. This way you can imitate the custom hash function I think. Would that work for you @sisp ?
Yes, I believe #4363 would be a better solution than introducing a stage that computes the cache key. :+1:
Ultimately, I'd still love to see the cache key computation based on a proper import tree analysis. π
You can allow glob patterns in deps
and run something like this - src/**/*.py
- to compute the hash.
Substitute this glob pattern in here to generate hash.
git ls-files -sc <glob pattern> | git hash-object --stdin
Ultimately, I'd still love to see the cache key computation based on a proper import tree analysis. π
+1 for this suggestion.
Just a few more projects I've come across that might be worth looking into for inspiration or to help in implementing a solution:
These transitive dependencies can raise from other things than direct imports, such as a change in a pyproject.toml
or a setup.cfg
.
Moreover using something like pydeps
would only solve the issue for Python, not other languages such as R.
So I think something a bit more bespoke would be necessary.
On the other hand, computing hashes ourselves as stages seems very hacky: it relies on the user's understanding of DVC's internals, and makes the intention behind a pipeline much more opaque. In #10599, I was suggesting a more declarative interface, something like:
dependencies:
scripts/*.py:
- pyproject.toml
- src/**/*.py
How this would then be implemented is still to be determined, but using "virtual stages" with custom hashes as suggested above would definitely be an option.
Substituting stage dependencies with themselves and all their explicit dependencies, into dvc.lock
, is another.
Bug Report
Description
When defining a stage in
dvc.yaml
, it is important to declare dependencies via thedeps
field, so that DVC can leverage its cache to avoid rerunning the stage unnecessarily. Examples in the docs often show stages in which commands use local Python scripts and those scripts are also declared as a dependencies such that changes in those scripts will lead to reruns of the stages. See e.g.:I think DVC's view on pipeline code organization and dependencies declaration is too narrow and may lead to incorrect stage caching. Let me elaborate.
A Python script is typically not self-contained but imports symbols from other local or third-party Python modules or packages. Especially local imports are subject to (breaking) change and must be declared as stage dependencies, too. In fact, the entire import tree of local imports must be considered for determining whether a stage needs to be rerun. But also third-party dependencies may change in ways that the output of a stage changes without any change in local code, e.g. when the version of a third-party dependency is bumped to a new major version. So ultimately, the entire import tree of a Python script must be considered. Of course, a change anywhere the import tree does not always mean the stage cache must be invalidated because the change may not affect the stage output, so there is a chance of rerunning a stage unnecessarily. But it's better to rerun too often than to miss a relevant change.
Reproduce
dvc init
.dvc.yaml
file:Add
a.py
:b.py
:dvc repro
and observe thatout.txt
contains "hello".b.py
to print "hello world" instead.dvc repro
and observe thatout.txt
hasn't changed and DVC prints:Expected
After editing
b.py
,dvc repro
should have rerun the stage andout.txt
should contain "hello world".Environment information
Output of
dvc doctor
:Solution idea To solve this problem, DVC could detect whether a stage dependency is written in Python, and if so, analyze the import tree using
pydeps
, based on which a better cache key could be computed that captures changes also in (transitively) imported files.WDYT?