git-annex manages only loosely connected components. We should add a CI script (a Github Action?) to impose some daily/weekly/monthly fscking of our dataset to make sure it's in a consistent state.
[ ] Corruption. git-annex names content according to their checksums, and can catch and refuse to download a file that has been corrupted, but this is only caught at download time, so we could have a broken dataset (or worse: a broken past version) and not know it.
[ ] Wasted space. Files uploaded to the bucket but then dropped in the course of code review will stick around, unlike with plain git where they can be rebase/squashed away. We should add something like git gc that finds unused files on Amazon and erases them.
there's git annex unused, but I'm pretty sure that only looks at files on the tips of branches (or tags), whereas we want to keep files mentioned in any commit.
git-annex manages only loosely connected components. We should add a CI script (a Github Action?) to impose some daily/weekly/monthly fscking of our dataset to make sure it's in a consistent state.
git-annex
names content according to their checksums, and can catch and refuse to download a file that has been corrupted, but this is only caught at download time, so we could have a broken dataset (or worse: a broken past version) and not know it.rm
'd and go missing on Amazon (OpenNeuro is struggling with this very issue: https://github.com/OpenNeuroOrg/openneuro/issues/1649 / https://github.com/OpenNeuroOrg/openneuro/issues/1361).git gc
that finds unused files on Amazon and erases them.git annex unused
, but I'm pretty sure that only looks at files on the tips of branches (or tags), whereas we want to keep files mentioned in any commit.