spine-generic / data-multi-subject

Multi-subject data for the Spine Generic project
Creative Commons Attribution 4.0 International
22 stars 15 forks source link

Speed up git-annex operations #69

Closed kousu closed 3 years ago

kousu commented 3 years ago

Turns out I was misled by git-annex v8's default configuration of

# .git/info/attributes
*  filter=annex

This is wildly slow. This means git-annex processes every single file on every single commit; it has optimizations it seems, enough to not need to rehash unchanged files, but even just opening them up to check is slow on a dataset this large.

In our application we don't want to annex every single file. That's painfully wasteful and that's not how I set it up. It turns out I can do one better though: by only letting git-annex get its fingers on the files we want to annex, and making sure git processes the rest directly, commit times are hugely improved. And it's not actually necessary for git-annex to see all the files; it is happy to accept this .gitattributes; it only writes its overly-greedy default to a clone's .git/info/attributes if there is no preexisting .gitattributes, presumably in an attempt to give consistent user experience (at the hidden cost of performance).