Turns out I was misled by git-annex v8's default configuration of
# .git/info/attributes
* filter=annex
This is wildly slow. This means git-annex processes every single file on every single commit; it has optimizations it seems, enough to not need to rehash unchanged files, but even just opening them up to check is slow on a dataset this large.
In our application we don't want to annex every single file. That's painfully wasteful and that's not how I set it up. It turns out I can do one better though: by only letting git-annex get its fingers on the files we want to annex, and making sure git processes the rest directly, commit times are hugely improved. And it's not actually necessary for git-annex to see all the files; it is happy to accept this .gitattributes; it only writes its overly-greedy default to a clone's .git/info/attributes if there is no preexisting .gitattributes, presumably in an attempt to give consistent user experience (at the hidden cost of performance).
Turns out I was misled by git-annex v8's default configuration of
This is wildly slow. This means git-annex processes every single file on every single commit; it has optimizations it seems, enough to not need to rehash unchanged files, but even just opening them up to check is slow on a dataset this large.
In our application we don't want to annex every single file. That's painfully wasteful and that's not how I set it up. It turns out I can do one better though: by only letting git-annex get its fingers on the files we want to annex, and making sure git processes the rest directly, commit times are hugely improved. And it's not actually necessary for git-annex to see all the files; it is happy to accept this
.gitattributes
; it only writes its overly-greedy default to a clone's.git/info/attributes
if there is no preexisting.gitattributes
, presumably in an attempt to give consistent user experience (at the hidden cost of performance).