spine-generic / data-multi-subject

Multi-subject data for the Spine Generic project
Creative Commons Attribution 4.0 International
22 stars 15 forks source link

fscking #2

Open kousu opened 4 years ago

kousu commented 4 years ago

git-annex manages only loosely connected components. We should add a CI script (a Github Action?) to impose some daily/weekly/monthly fscking of our dataset to make sure it's in a consistent state.

  1. [ ] Corruption. git-annex names content according to their checksums, and can catch and refuse to download a file that has been corrupted, but this is only caught at download time, so we could have a broken dataset (or worse: a broken past version) and not know it.
  2. [ ] Missing files. Files might be get rm'd and go missing on Amazon (OpenNeuro is struggling with this very issue: https://github.com/OpenNeuroOrg/openneuro/issues/1649 / https://github.com/OpenNeuroOrg/openneuro/issues/1361).
  3. [ ] Wasted space. Files uploaded to the bucket but then dropped in the course of code review will stick around, unlike with plain git where they can be rebase/squashed away. We should add something like git gc that finds unused files on Amazon and erases them.
    • there's git annex unused, but I'm pretty sure that only looks at files on the tips of branches (or tags), whereas we want to keep files mentioned in any commit.