spine-generic / data-multi-subject

Multi-subject data for the Spine Generic project
Creative Commons Attribution 4.0 International
22 stars 15 forks source link

Use caching in CI #82

Closed kousu closed 3 years ago

kousu commented 3 years ago

This combines https://git-annex.branchable.com/tips/local_caching_of_annexed_files/ with https://github.com/actions/cache to speed up dataset validation. New versions are bootstrapped from the most recent copy inside Github's datacenter, instead of having to wait to download and redownload the whole thing from Amazon. It also saves us paying Amazon for the bandwidth.

Caveats:

kousu commented 3 years ago

I put a lot of effort into this and I got it working(!), as arcane as the whole thing turned out to be, but there's a final problem: this dataset is too big to cache. Github's caching limit is 5GB per repo. And they don't support caching partial datasets either, it's all of nothing. :pretzel:

It's too bad. When the total size comes out to under 5GB the cache is really really fast.

kousu commented 3 years ago

Still to do:

kousu commented 3 years ago

I figured out why git annex copy --to cache is slow: it's actually copying. So it's duplicating the entire dataset. Which is...kind of the opposite of what we want a cache for.

It looks like annex.hardlink only works when set on the remotes.

So maybe I can work around this by switching it around? git config annex.hardlink true on the local dataset before copying? Or go up into the cache and add the local dataset?

Or, maybe we could tell github's caching to cache .git, and when it restores it, use cache-hit to detect that and in that case *mv .git ~/.annex-cache and then* checkout the repo and initialize it.

kousu commented 3 years ago

I've solved the git-annex hardlink issue.

New problem: github is just glitching out. Maybe the cache is too large?

This run generates a 3.9GB cache: https://github.com/spine-generic/data-multi-subject/runs/2334909791?check_suite_focus=true. It first says

Cache not found for input keys: fannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066, fannex-

then at the end it says

/usr/bin/tar --posix --use-compress-program zstd -T0 -cf cache.tzst -P -C /home/runner/work/data-multi-subject/data-multi-subject --files-from manifest.txt Cache Size: ~3919 MB (4109396872 B) Cache saved successfully Cache saved with key: fannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066

But in the following run it says

Cache not found for input keys: fannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066, fannex-

so...something is broken.

I'll try making the cach smaller and see if that works.

kousu commented 3 years ago

https://github.com/spine-generic/data-multi-subject/runs/2335797301?check_suite_focus=true#step:12:11

Cache not found for input keys: gannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066, gannex-

In https://github.com/spine-generic/data-multi-subject/runs/2335797301?check_suite_focus=true#step:16:1 the dataset took 2m25s to download. Then

https://github.com/spine-generic/data-multi-subject/runs/2335797301?check_suite_focus=true#step:44:3 it took 1m41s to upload to Github's cache

Cache Size: ~3060 MB (3208872259 B)
Cache saved successfully
Cache saved with key: gannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066

In the follow up run,

https://github.com/spine-generic/data-multi-subject/runs/2335860314?check_suite_focus=true#step:12:12 it took 1m51s to download the cache

Received 3107979263 of 3208872259 (96.9%), 35.4 MBs/sec
Received 3141533695 of 3208872259 (97.9%), 35.4 MBs/sec
Received 3183476735 of 3208872259 (99.2%), 35.5 MBs/sec
Received 3208872259 of 3208872259 (100.0%), 34.3 MBs/sec
Cache Size: ~3060 MB (3208872259 B)
/usr/bin/tar --use-compress-program zstd -d -xf /home/runner/work/_temp/33f6a11a-dab8-4b4d-badb-ce0b3511c747/cache.tzst -P -C /home/runner/work/data-multi-subject/data-multi-subject
Cache restored successfully
Cache restored from key: gannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066

and https://github.com/spine-generic/data-multi-subject/runs/2335860314?check_suite_focus=true#step:16:1 still 1m16s to actually make the cache usable -- probably because git-annex is checksumming everything. Which...I guess..it..should.


[...]
get sub-amu01/dwi/sub-amu01_acq-b0_dwi.nii.gz (from cache...) (checksum...) ok
get sub-amu01/anat/sub-amu01_acq-T1w_MTS.nii.gz (from cache...) (checksum...) ok
get sub-amu02/dwi/sub-amu02_acq-b0_dwi.nii.gz (from cache...) (checksum...) ok
[...]

and at the end https://github.com/spine-generic/data-multi-subject/runs/2335860314?check_suite_focus=true#step:44:1

Cache hit occurred on the primary key gannex-42a977c4d964e21ce58749ac61d5ce7b5728116662d82c0d1fcbcd4c2c643066, not saving cache.
kousu commented 3 years ago

We can disable checksumming the cache with

git config remote.cache.annex-verify false

I'm testing this in https://github.com/spine-generic/data-multi-subject/actions/runs/745631720.

But do we want to? I'm unsure. It's trading correctness for speed because if Github/Microsoft/Azure corrupts our data by accident, well, then we'll have some confusing PRs.

This didn't save that much time: only a 15s. because using the cache isn't hardlinking to the cache. probably because I set annex.thin true.

.....

kousu commented 3 years ago

So:

kousu commented 3 years ago

Note to self: try a tweaked strategy of caching .git directly, but if the cache was restored, doing mv .git ../.annex-cache. Maybe that'll be simpler? There will be no need for the git annex copy --to cache step then, which is currently giving me trouble because of the conflicting features of annex.hardlink and annex.thin.

kousu commented 3 years ago

It took me a long time but I've figured out how to avoid redundant copying of our datasets. Sort of. See https://github.com/kousu/test-git-annex-hardlinks/tree/bare/annex-hardlinks.sh. Details at https://github.com/kousu/test-git-annex-hardlinks/tree/trunk/README.md.

The wiki page does not explain how to actually read the cache back efficiently. It sets up the cache to take hardlinks from your repos, but apparently you can't then reuse it by taking hardlinks to it. Maybe this used to work. Maybe it wasn't a problem in v7 when everything was symlinks instead of v8's 'unlocked', full copy, files by default.

I ran into several bugs in git-annex figuring this out and eventually decided to side-step it for the final part. That seems to work and I can integrate it into this PR once I wake up again.

I've left a bug report of sorts. We'll see if we catch any fishes. I'm not holding my breath though, since the last comment on that thread was by @yarikoptic over a year ago with no answer. :shrug:

yarikoptic commented 3 years ago
kousu commented 3 years ago

My reproducer is https://github.com/kousu/test-git-annex-hardlinks :)

It has full version info embedded in https://github.com/kousu/test-git-annex-hardlinks/blob/trunk/log.txt

kousu commented 3 years ago

Thanks for dropping in @yarikoptic, I appreciate that you could grab some time to help us out in the neuroinfo community.

* re cache: I wondered -- why not to cache entire dataset, and then just `datalad update --merge && datalad get .` it to get up-to-dated (unless you expect non-fastforwards, then `datalad update && git reset --hard origin/master && datalad get` or alike; I just merged [datalad/datalad#5534](https://github.com/datalad/datalad/pull/5534) so next "major" release should have `update --how=reset` to accomplish that), and then reupload updated one to cache?

My plan is indeed to cache the entire dataset.

This initializes the cache:

https://github.com/spine-generic/data-multi-subject/blob/c8e828fc47c459e72812f1f5f145219dfe0f86a7/.github/workflows/validator.yml#L121-L129

this restores the cache, if it exists (skipping the previous step):

https://github.com/spine-generic/data-multi-subject/blob/c8e828fc47c459e72812f1f5f145219dfe0f86a7/.github/workflows/validator.yml#L111-L119

this downloads from the local cache or the upstream (amazon) remote:

https://github.com/spine-generic/data-multi-subject/blob/c8e828fc47c459e72812f1f5f145219dfe0f86a7/.github/workflows/validator.yml#L131-L153

finally this uploads to the local cache, so it should integrate any changes from amazon into the cache:

https://github.com/spine-generic/data-multi-subject/blob/c8e828fc47c459e72812f1f5f145219dfe0f86a7/.github/workflows/validator.yml#L175

and finally there's an implicit "upload to Github's cache server" step implied by actions/cache@v2.

The problem I posted about on the git-annex wiki is that the download step wastes an entire extra copy (in my sample, ~3 gigabytes; in reality, ~10):

https://github.com/spine-generic/data-multi-subject/pull/82/checks?check_run_id=2345427361#step:15:8 -> shows 80% disk usage

https://github.com/spine-generic/data-multi-subject/pull/82/checks?check_run_id=2345427361#step:17:8 -> shows 84%

the intervening download step only downloads from the cache: https://github.com/spine-generic/data-multi-subject/pull/82/checks?check_run_id=2345427361#step:16:18

I am looking for a way to avoid this. Given how fast Github's network is, making these copies are almost as slow as just downloading everything from Amazon every time.

Does datalad get know how to make hardlinks when git-annex won't?

yarikoptic commented 3 years ago

re reproducer - more specifically it is https://github.com/kousu/test-git-annex-hardlinks/blob/trunk/annex-hardlinks.sh

it is very nice and detailed, but may be a bit too much/too detailed, too much to grasp, and unlikely anyone would dare to run it on their system, e.g. due to

git config --global advice.detachedHead false
git config --global annex.alwayscommit false

etc. I was talking about something as little as possible to the point. Smth like http://www.onerussian.com/tmp/check-annex-thin-hardlink.sh which I had for some other issue. you could always export HOME=$PWD in it after jumping to a temp dir if needed to modify global config etc.

yarikoptic commented 3 years ago

My plan is indeed to cache the entire dataset.

but then why do you need cache at all?

kousu commented 3 years ago

My plan is indeed to cache the entire dataset.

but then why do you need cache at all?

I want to cache the dataset to speed up CI and to avoid paying amazon for redundant bandwidth. Our status quo is 10 minutes and 10GB (which is about ~20 cents paid to AWS) per run of bids-validator. And this isn't even a very large dataset. Any improvement on that front would help us out in the long term a lot.

re reproducer - more specifically it is https://github.com/kousu/test-git-annex-hardlinks/blob/trunk/annex-hardlinks.sh

it is very nice and detailed, but may be a bit too much/too detailed, too much to grasp

I don't know how to shorten it. I wanted to copy the wiki page closely to ensure I would be able to get useful feedback from the forum.

Skipping the utilities, the main script is basically just a copy of what's in that wiki page, and it comes out to 70 lines, just double your reproducer's 34 lines, and most of that is because I'm making two datasets instances, one to serve as the cache-miss and one to serve as the cache-hit copy.

and unlikely anyone would dare to run it on their system, e.g. due to

git config --global advice.detachedHead false
git config --global annex.alwayscommit false

etc. I was talking about something as little as possible to the point. Smth like http://www.onerussian.com/tmp/check-annex-thin-hardlink.sh which I had for some other issue. you could always export HOME=$PWD in it after jumping to a temp dir if needed to modify global config etc.

That's good feedback, and also a good tip! I'll add it. That won't break anything else?

yarikoptic commented 3 years ago

My plan is indeed to cache the entire dataset.

but then why do you need cache at all?

I want to cache the dataset to speed up CI and to avoid paying amazon for redundant bandwidth. Our status quo is 10 minutes and 10GB (which is about ~20 cents paid to AWS) per run of bids-validator. And this isn't even a very large dataset. Any improvement on that front would help us out in the long term a lot.

I understand the rationale. What I meant is why to bother with establishing git-annex cache repo, and not just push the original entire dataset to github cache?

If you need to have some modification to dataset which you do not want to push to cache (edit: github cache, and thus the original cached dataset), you could create a quick/throw away copy via datalad install --reckless shared-all or even if those files would be reproducible, then --reckless ephemeral (would symlink .git/annex to the original dataset - but that would mean that newly added keys would be added to original annex/objects as well; so do not want to grow it with keys which are not reproducible across multiple CI runs)

kousu commented 3 years ago

I understand the rationale. What I meant is why to bother with establishing git-annex cache repo, and not just push the original entire dataset to github cache?

I misunderstood. This is a very very good question. And I would love to do that, but I don't know how. As far as I can tell, it's not possible.

https://github.com/spine-generic/data-multi-subject/blob/4134fe808c686f90ed7f7ca44afaa05add6ed5d3/.github/workflows/validator.yml#L50-L51

creates a fresh clone of the dataset, which is what I want, I want to be testing the fresh version. But I can't just merge two .git/ folders, so I can't push the entire dataset to github's cache and then reuse it directly because then how could I test updates?

I can't cache just .git/annex/ because that confuses git-annex:

demo of caching `.git/annex/` ``` $ cd $(mktemp -d) ``` ``` $ git clone -b r20201130 https://github.com/spine-generic/data-multi-subject data-multi-subject1 Cloning into 'data-multi-subject1'... remote: Enumerating objects: 47703, done. remote: Counting objects: 100% (14497/14497), done. remote: Compressing objects: 100% (4077/4077), done. remote: Total 47703 (delta 7659), reused 14434 (delta 7651), pack-reused 33206 Receiving objects: 100% (47703/47703), 4.71 MiB | 3.86 MiB/s, done. Resolving deltas: 100% (17078/17078), done. $ cd data-multi-subject1/ $ git annex get $FILE (scanning for unlocked files...) get sub-amu01/dwi/sub-amu01_dwi.nii.gz (from amazon...) (checksum...) ok (recording state in git...) $ git annex whereis $FILE whereis sub-amu01/dwi/sub-amu01_dwi.nii.gz (4 copies) 5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon] 5cdba4fc-8d50-4e89-bb0c-a3a4f9449666 -- julien@julien-macbook.local:~/code/spine-generic/data-multi-subject 9e4d13f3-30e1-4a29-8b86-670879928606 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject bd8bd066-246b-43e4-9631-bf42351b214a -- kousu@requiem:/tmp/tmp.A2vJ7kKzOw/data-multi-subject1 [here] amazon: https://data-multi-subject---spine-generic---neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s8901683--cc39b2f0bea6673904d3926dc899f339ac59f5f9a114563fe01b20278b99435d.nii.gz ok $ $ # "upload" .git/annex to the "cache server" $ mv data-multi-subject1/.git/annex/ cache ``` The follow up run: ``` $ git clone -b r20201130 https://github.com/spine-generic/data-multi-subject data-multi-subject2 Cloning into 'data-multi-subject2'... remote: Enumerating objects: 47703, done. remote: Counting objects: 100% (14497/14497), done. remote: Compressing objects: 100% (4078/4078), done. remote: Total 47703 (delta 7659), reused 14433 (delta 7650), pack-reused 33206 Receiving objects: 100% (47703/47703), 4.71 MiB | 3.75 MiB/s, done. Resolving deltas: 100% (17078/17078), done. $ cd data-multi-subject2/ $ git annex get $FILE git-annex: This repository is not initialized for use by git-annex, but .git/annex/objects/ exists, which indicates this repository was used by git-annex before, and may have lost its annex.uuid and annex.version configs. Either set back missing configs, or run git-annex init to initialize with a new uuid. ``` Okay so I go try to please it: ``` $ git config annex.uuid bd8bd066-246b-43e4-9631-bf42351b214a $ git config annex.version 8 $ git annex get $FILE # no reaction? $ ls -lh $FILE # and no effect: still 105 bytes -rw-r--r-- 1 kousu 105 Apr 19 00:44 sub-amu01/dwi/sub-amu01_dwi.nii.gz ```

Caching the entire entire dataset but then setting it up as a annex.hardlink remote as the wiki suggests seems to almost work.

(sorry it took me a few days to respond; I was fixing up some other things. I do appreciate your time and I know there's a conference to go to tomorrow)

I've also cleaned up https://github.com/kousu/test-git-annex-hardlinks/blob/annex-cache/annex-hardlinks.sh according to your suggestions, so, thanks for pressing me to do better there.

yarikoptic commented 3 years ago

But I can't just merge two .git/ folders, so I can't push the entire dataset to github's cache and then reuse it directly because then how could I test updates?

see above on datalad update?

kousu commented 3 years ago

I'm giving up on this for now. I can't get it to work.

yarikoptic commented 3 years ago

:-( I might try some time later myself as well since would be a nice feature to have

yarikoptic commented 3 years ago

BTW, for your cases -- do you care only about Linux and/or OSX or also about Windows? (situation on windows could be "different")

kousu commented 3 years ago

I just wanted to get testing working in CI, so that means ubuntu 20.04 right now. It's a pretty niche use case.

I can imagine Windows might be tricky, though in principle supported

Hard links are fully supported on NTFS and NFS file systems

https://docs.microsoft.com/en-us/windows/win32/fileio/hard-links-and-junctions

To create a hard link, use the CreateHardLink function.

https://superuser.com/a/850774

The utility on Windows is called mklink

IMO This should be addressed in git-annex. annex.thin and annex.hardlink need to be merged into one feature or something. Maybe annex.hardlink needs to go away and just pay attention directly to .git/objects/info/alternates (git clone --shared; ...; git annex init already causes git config annex.hardlink true) so that annex.thin can take precedence.

I'll give datalad update a shot later this week.