Closed PythonFZ closed 3 months ago
Hey, @PythonFZ. Thanks for your question! Dud was intentionally designed to be agnostic of stage and artifact history--instead relying on Git or other SCM tools to manage a history of versions. This dramatically simplifies Dud's design, but as you're pointing out, it complicates historical querying.
Generally how I'd approach this problem in Dud would be to use Git to checkout previous versions of your stage file(s), and then use Dud to retrieve the stage's artifacts at that point in history. TLDR: git checkout -- stage.yaml
then dud checkout/pull stage.yaml
.
IIRC, DVC accomplishes what you're describing by maintaining its own database of stages/artifacts and their checksums over time--essentially duplicating the data stored in Git in a new database for easier querying. Maintaining such a database greatly complicates most core operations for a tool like DVC and Dud, and ensuring it stays synchronized with Git is another source of slow-down in DVC.
This is not something I plan to implement in Dud, but if you have a design in mind for how to accomplish what you need, please feel free to propose it.
Is there a way to search the
cache
for previous results from a run and checkout the values?Assuming deterministic code, each
stage
is fully defined by itscommand
andinputs
(and code). Is it possible to search the local (and remote) cache for the results of astage
given astage-file.yaml
without thechecksum
?Within DVC this is bound to the name of the stage, which is based on how the code is written but not necessary at all.