kevin-hanselman / dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.
https://kevin-hanselman.github.io/dud/
BSD 3-Clause "New" or "Revised" License
183 stars 8 forks source link

`dud run` look for previous results #213

Closed PythonFZ closed 3 months ago

PythonFZ commented 3 months ago

Is there a way to search the cache for previous results from a run and checkout the values?

Assuming deterministic code, each stage is fully defined by its command and inputs (and code). Is it possible to search the local (and remote) cache for the results of a stage given a stage-file.yaml without the checksum?

Within DVC this is bound to the name of the stage, which is based on how the code is written but not necessary at all.

kevin-hanselman commented 3 months ago

Hey, @PythonFZ. Thanks for your question! Dud was intentionally designed to be agnostic of stage and artifact history--instead relying on Git or other SCM tools to manage a history of versions. This dramatically simplifies Dud's design, but as you're pointing out, it complicates historical querying.

Generally how I'd approach this problem in Dud would be to use Git to checkout previous versions of your stage file(s), and then use Dud to retrieve the stage's artifacts at that point in history. TLDR: git checkout -- stage.yaml then dud checkout/pull stage.yaml.

IIRC, DVC accomplishes what you're describing by maintaining its own database of stages/artifacts and their checksums over time--essentially duplicating the data stored in Git in a new database for easier querying. Maintaining such a database greatly complicates most core operations for a tool like DVC and Dud, and ensuring it stays synchronized with Git is another source of slow-down in DVC.

This is not something I plan to implement in Dud, but if you have a design in mind for how to accomplish what you need, please feel free to propose it.