iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.7k stars 1.18k forks source link

Preserve timestamps during caching #8602

Closed johnyaku closed 1 year ago

johnyaku commented 1 year ago

Background

DVC pipelines make decisions about whether to execute stages based on the content (checksum) of the dependencies. This is awesome and it is one of the reasons why we are planning to use DVC for top-level pipeline orchestration.

Unfortunately, DVC pipelines lack features from other worfklow managers, such as parallelization and environment switching. This is both a blessing and a curse -- a blessing because it means that DVC pipelines are simple and easy to learn, but a curse because features such as parallelization are central to our existing workflows.

So we are working on use DVC pipelines to coordinate Snakemake workflows. DVC takes care of data integrity, while Snakemake iterates over samples and orchestrates parallel processing, etc.

So far this is going well, at least at the DVC level. But Snakemake makes its decisions about what to execute based on timestamps.

However, when a file is added to a DVC project via dvc add and dvc repro both the symlink AND the cached data have a new timestamp corresponding to the time of the DVC operation.

As a result, if we tinker with the content of a stage (a Snakemake workflow) we have to re-run the entire stage (workflow) and not just the new bits, unless we fuss around touching timestamps. This is tedious and error prone, and "rewrites history" by assigning false timestamps.

(Of course, if neither the workflow (stage) or its dependencies have changed, then the entire workflow (stage) is skipped, which is great.)

We prefer the checksum-based execution decisions as in DVC, but we would like to make this compatible with the timestamp-based decisions in Snakemake workflows.

Feature request:

Add an option to dvc add and dvc repro to preserve timestamps.

Specifically, when this option is specified for each file or directory added to a DVC project, both the symlink in the workspace and the actual data in the cache should have a timestamp matching that of the original data that was added.

If identical data is added later (identical in content, that is), then the timestamps can be updated to match that of the later file.

In addition, add an option to dvc checkout so that the timestamps of the symlinks created in the workspace match those of the target data in the cache.

Together, these two changes should allow DVC and Snakemake to play nicely together :)

Who knows, it might even make sense to make these the default options ...

@dlroden

dberenbaum commented 1 year ago

Hi @johnyaku! Thanks for the request.

We discussed as a team, and it sounds like we could make some improvements on preserving timestamps, but ensuring DVC preserves the same timestamp for any matching content is not a simple fix and not something we are likely to add. I think it could be unexpected to create a new file and see an old timestamp because it matches existing content from the cache. If you want, we can keep this issue open to reduce changing timestamps where possible.

You may already know that we have a longstanding open issue for parallelizing pipelines in https://github.com/iterative/dvc/issues/755, and addressing this is the more likely long-term solution here. The same is true for environment handling, although I hope this one's easier to workaround by activating the environments as part of your stage commands.

johnyaku commented 1 year ago

Thanks @dberenbaum for the quick reply. This responsiveness has been a key factor in our decision to put faith in DVC.

On reflection, checking out files with old timestamps might not be appropriate as default behaviour, but it would be very helpful to have this option for use with our Snakemake modules.

FWIW, I think checksum-based execution decisions are far superior to timestamp-based decisions, and it is interesting to note that their is an open issue for Snakemake to implement checksums!

On the other hand, Snakemake's parallelisation, sample iteration and environment-switching features are very mature and we have several legacy workflows that we'd like to leverage. Snakemake workflows port flexibly across K8s and multiple different HPC vendors, as well as stand-alone PCs, and we feel that this platform-independence is a key component of scientific reproducibility. Snakemake also provides helpful reports for optimising resource allocations.

So altho we will be watching #755 and the evolution of DVC pipelines with interest, I expect it will be some time before it can replicate the full functionality of more mature workflow managers. And one of the things I like best about DVC pipelines is the simplicity, so I question whether it is really necessary for DVC to duplicate all of these features, especially if it can "hand off" complex tasks like parallelization to Snakemake or Nextflow, etc.

We have given quite a lot of thought to Snakemake-DVC integration, and are happy to share what we have come up with. But at the moment the key obstacle is timestamp management. The behaviour requested in the feature request might not suit all users in all situations, but it would be great to have as an option, perhaps even something that could be specified in .dvc/config so as to apply to all add, repro, pull and checkout operations for a whole project.

johnyaku commented 1 year ago

Outside of Snakemake, we've also noticed that some other tools also use timestamp-based sanity checks.

For example, .bam files store compressed genomic data, and are often accompanied by a .bai index file to enable random access. The index is created after the compression is complete, and so downstream tools use timestamps as a sanity check, since an index that is older than its target is likely to be out of date.

Because dvc add etc don't preserve timestamps, we sometimes have to deal with a LOT of warnings/errors. We can deal with this via strategic touching or unprotecting but it would be so much cleaner if timestamps could be simply preserved.

Perhaps timestamps (creation and/or modification times) could be captured in .dvc files as part of dvc add and dvc repro? Then users could have the option to apply this meta data to the symlinks in the workspace and/or the data in the cache.

Or it might be more straightforward to note existing timestamps and then apply them as soon as the links/cached files are created (if requested via an additional option or specified via config).

efiop commented 1 year ago

@johnyaku Saving timestamps as metadata in dvcfiles is indeed reasonable and would be a generally useful thing to have. Due to some other limitations, right now this can only be implemented for standaline files but not for files inside of dvc-tracked directories (the legacy .dir object format don't support that and we have newer mechanisms that are not yet enabled by default).

Regarding dvc setting the mtime back, this can be done, but it is more involved and is conflicting with symlinks and hardlinks, since they share the same inode with cache and it can be used in multiple places with different desired timestamps (though this should be doable with copies and maybe reflinks). Also there are limitations like different mtime resolution on different filesystems (e.g. APFS is notorious for having a 1 sec resolution). Overall with many caveats, this can be done (somewhat related to how we handle isexec), but requires working towards a specific scenario (e.g. snakemake, which we are not using). I'm not sure though that all the caveats will make it worthwhile to be accepted in the upstream dvc, especially with us having our own pipeline management.

johnyaku commented 1 year ago

Thanks @efiop! I'm very grateful for the consideration you're giving this.

Re file system latency, I think that if the timestamp gap is small (in the order of a second or two) then the computational cost of re-running that processing will also be small, so I think we can live with it. Also, to clarify our intended use case, in situations where we use a Snakemake workflow to implement a DVC stage, I expect that the entire workflow will run to completion before the outs specified in the DVC stage are cached. That is, DVC caching (and any associated timestamp hacking) will happen after Snakemake has finished running its workflow, and so Snakemake will no longer be relying on these timestamps. As a result, timestamp manipulations during caching are unlikely to affect Snakemake during that particular execution. Hopefully it will be possible to snapshot the timestamp of each file and directory prior (*) to caching, and then apply this timestamp to both the link and the cache after (*) caching.

If this were possible, the advantage for us would only be apparent on subsequent reproductions (**). Specifically, if we modify one of the rules in the Snakemake workflow then the worfklow will need to run again, since it is itself one of the deps of the parent DVC stage. However, most of the rules within the workflow will probably still be the same, and their intermediate files may not need to be regenerated. If timestamps can be preserved, Snakemake will be able to decide intelligently what needs to be re-run, but currently timestamp re-writing forces the entire workflow to be re-executed, which can sometimes take a couple of days even on high performance hardware with extensive parallelisation.

You make a good point about potential conflicts arising from inconsistent timestamps amongst multiple links to the same cached file. This makes me think that perhaps timestamp preservation should be an "all or nothing" option, specified in the config rather than via options to add, repro, etc. At the risk of introducing further complications, timestamp preservation may need to be extended to remotes as well, in order to ensure consistency between instances on different platforms.

(*) TIming may be critical here in at least two aspects: 1) handover between DVC stages, 2) initiation of subsequent DVC stages that may themselves also be Snakemake workflows, and which include deps generated by an earlier stage. I think everything should be OK so long as the initiation of both 1) and 2) takes place after timestamp restoration, rather than immediately after the earlier stage finishes executing its cmd.

(**) "Subsequent reproductions" includes reproductions in other instances (clones) of the DVC project. A colleague may wish to checkout a project with the express purpose of tweaking one of the workflow-stages, perhaps something as simple as tweaking the formatting of the summary report for that workflow-stage. Ideally they should be able to reproduce the pipeline -- including re-running the tweaked stage but in such a way that only the bare minimum is actually re-executed (regenerating the report in this example).

There is vigorous debate within our group as to whether we should use Snakemake to coordinate multiple workflow modules (while asking Snakemake to dvc add the results as we go) or whether we should use DVC to coordinate multiple workflows (including, occassionally, Nextflow etc). I am strongly advocating for the latter, because I believe that checksum decisions are superior to timestamp decisions, and because dvc.lock ties everything together so beautifully, but timestamp rewriting is proving challenging. I appreciate that DVC has its own ambitions to become a fully mature pipeline manager, but I would like to draw your attention to the fact that most mature workflow managers include "handover" features for integration with other workflow managers. In order to fit into this ecosystem DVC may need to preserve timestamps, or at least offer an option to do so.

efiop commented 1 year ago

@johnyaku Btw, have you tried --no-commit? Maybe it could be a local workaround for you. It will not touch files until you tell it to with dvc commit.

johnyaku commented 1 year ago

Thanks @efiop. I hadn't appreciated --no-commit. It looks like it will do what we need within a particular instance (clone) of a project, which is the most common scenario.

There might also be a way to hack --meta (for add) and meta: (in dvc.yaml for repro) to capture timestamp information as we go.