Open nrminor opened 9 months ago
Hi Nick,
I'm so glad to hear you're liking SciDataFlow, and that you're using it in a bunch of projects!
I think I see what you're getting at here but let me check. SDF currently has sdf status
, which will indicate changes, but you want a "lock" feature that will raise an error when something changes. Is that right? Should it error whenever any SDF command is run, or just sdf status
?
Yep, you got it. My original idea for how to implement it was indeed a command line arg on sdf status
, e.g.:
sdf status --locked # exit code 1 if a file is changed
That said, though I hadn't considered it before, having it as a global option is appealing too, both from an under-the-hood implementation perspective as well as from a usage perspective.
From the implementation perspective, you could add a boolean locked
field to your DataFile
struct and then add/modify some impl
blocks to handle it:
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct DataFile {
pub path: String,
pub tracked: bool,
pub md5: String,
pub size: u64,
pub url: Option<String>,
pub locked: bool, // new locked field
}
From a usage perspective, I could see it being useful like this:
# initialize the project
sdf init
# add files you'd like to track, with it being important that file3 never changes
sdf add file1
sdf add file2
sdf add --locked file3
#
# run a bunch of other project code to generate reproducible results
#
sdf status # gives error if the locked file3 was changed
Hopefully that makes sense! Curious which of these implementations are more appealing to you.
This looks good to me! A few comments/ideas:
sdf lock <file1> <file2> ...
and maybe sdf unlock
etc.
Hi Vince,
Love the tool so far. I've been integrating it into more of my repos, especially publication code, and agree that a tool like this that is language-ecosystem agnostic was sorely needed. So, thanks for your work!
I have a Nextflow pipeline I'm developing that starts off by pulling some auxiliary data files for use downstream. Previously, I was just using
wget
with a hardcoded URL, whereas now I have the URLs in a data manifest. For the sake of a publication, I'd like SciDataFlow to give an error if the data manifest URL links to a file with a different md5 than was previously in the manifest. Think of it as an immutable or "locked" asset in the data manifest. This would enhance reproducibility in that the workflow would prevent future users from using different auxiliary data than was used in the manuscript.Curious if you see any value in this enhancement, or if other tools might be better suited here. If you do see some value here, I'd be happy to work on a PR for your review, probably involving an optional "locked"
clap
parameter insdf pull
,sdf status
, or both, where the default behavior would be to leave the asset unlocked and mutable.Thanks again for your good work, --Nick