Feature request: Optionally give an error and exit when a remote file has a different md5 than the original data manifest

nrminor commented 9 months ago

Hi Vince,

Love the tool so far. I've been integrating it into more of my repos, especially publication code, and agree that a tool like this that is language-ecosystem agnostic was sorely needed. So, thanks for your work!

I have a Nextflow pipeline I'm developing that starts off by pulling some auxiliary data files for use downstream. Previously, I was just using wget with a hardcoded URL, whereas now I have the URLs in a data manifest. For the sake of a publication, I'd like SciDataFlow to give an error if the data manifest URL links to a file with a different md5 than was previously in the manifest. Think of it as an immutable or "locked" asset in the data manifest. This would enhance reproducibility in that the workflow would prevent future users from using different auxiliary data than was used in the manuscript.

Curious if you see any value in this enhancement, or if other tools might be better suited here. If you do see some value here, I'd be happy to work on a PR for your review, probably involving an optional "locked" clap parameter in sdf pull, sdf status, or both, where the default behavior would be to leave the asset unlocked and mutable.

Thanks again for your good work, --Nick

vsbuffalo commented 8 months ago

Hi Nick,

I'm so glad to hear you're liking SciDataFlow, and that you're using it in a bunch of projects!

I think I see what you're getting at here but let me check. SDF currently has sdf status, which will indicate changes, but you want a "lock" feature that will raise an error when something changes. Is that right? Should it error whenever any SDF command is run, or just sdf status?

nrminor commented 8 months ago

Yep, you got it. My original idea for how to implement it was indeed a command line arg on sdf status, e.g.:

sdf status --locked     # exit code 1 if a file is changed

That said, though I hadn't considered it before, having it as a global option is appealing too, both from an under-the-hood implementation perspective as well as from a usage perspective.

From the implementation perspective, you could add a boolean locked field to your DataFile struct and then add/modify some impl blocks to handle it:

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct DataFile {
    pub path: String,
    pub tracked: bool,
    pub md5: String,
    pub size: u64,
    pub url: Option<String>,
    pub locked: bool, // new locked field
}

From a usage perspective, I could see it being useful like this:

# initialize the project
sdf init

# add files you'd like to track, with it being important that file3 never changes
sdf add file1
sdf add file2
sdf add --locked file3

#
# run a bunch of other project code to generate reproducible results
#

sdf status    # gives error if the locked file3 was changed

Hopefully that makes sense! Curious which of these implementations are more appealing to you.

vsbuffalo commented 8 months ago

This looks good to me! A few comments/ideas:

We need to test reverse-compatibility with previous manifest versions (actually, another thing I need to do is implement manifest versions explicitly! but maybe we can avoid that for now here). I don't want to break any compatibility with previous manifests. This should be easy with serde, as I think the default can bet set as no lock value -> not locked. I think a test indicating compatibility with past manifests would be good.
I think we can skip the global option for now, but I think we'd need a separate subcommand like sdf lock <file1> <file2> ... and maybe sdf unlock etc.
I like the exit status 1 idea — how would this integrate into the status? I think we could just indicate with a warning and a unicode X or something.

vsbuffalo / scidataflow

Feature request: Optionally give an error and exit when a remote file has a different md5 than the original data manifest #10