Open kousu opened 3 years ago
Thank you for opening this thread, @kousu. It clarifies very well where we are at in terms of git-annex usage. I would add to the counterargument list the many issues we've encountered with git-annex's version (v8 vs. v7 vs. v6, etc.): people not realizing that version requirement (this will happen, again and again), or not being able to download the proper version because of their distros, and the conda workaround is good for ppl who are familiar with the terminal, but not for my grandmother.
git, on the other hand, is more reliable with that respect.
@jcohenadad pointed out a way to mitigate point 4, about annex.thin
. Last summer when we were trying to to keep all of our data in plain git we were telling people to just download release tarballs: https://github.com/spine-generic/data-single-subject_DO-NOT-USE#download-zip-package-recommended
It's not at all unreasonable to ask users of the datasets to download releases as tarballs. Maybe even version-pinned in their software, with pooch
?
It would only be contributors to datasets who would need to get the git copy and suffer through having duplicated files. Which sucks. But we can provide those people with server space they can work on if the data is too much for their personal computers.
A really handy thing about Github is that it generates release tarballs dynamically, straight out of .git, so it doesn't need to keep duplicate files around. If we were to self host (#70), well, I assume Gogs can do the same thing? Gitolite (https://github.com/neuropoly/datalad/blob/ng/gitolite/internal-server.md) can't but maybe we can hack it in using git fast-export
?
note to self: unlike git-annex, git internally compresses everything in .git/objects/
with zlib
, and possibly more (it has something called "packfiles" which sound a lot like they are probably a .tar.gz or something), which means it is impossible to hardlink directly into that: the objects first need to be decompressed, which is what git does when you do git checkout ...
; also, for files like .nii.gz
or .mp3
which are already compressed, it's a huge waste of time.
If we were to try to recover feature #4 we would have to figure out how to disable compression for our chosen files, the way we currently choose to annex our chosen files with
# .gitattributes
*.nii filter=annex annex.largefiles=anything
*.nii.gz filter=annex annex.largefiles=anything
In fact we might even be able to do this with .gitattributes
: https://stackoverflow.com/questions/7102053/git-pull-without-remotely-compressing-objects
echo '*.nii.gz -delta' >> .gitattributes
(but -delta
might only be for fetch
ing, not for the on-disk format)
Here's an example of using git to interrogate its raw contents:
[kousu@requiem annex-hardlinks]$ git branch
annex-cache
annex-fix
* no-cache-hardlinks
trunk
[kousu@requiem annex-hardlinks]$ git ls-tree annex-cache
100755 blob 49a7e193e0473348059bc603791303fb372d6864 annex-hardlinks.sh
100644 blob 92d5ba00b4e5f923cee4e39603248073a76143cd mklogs.sh
[kousu@requiem annex-hardlinks]$ git cat-file -t 49a7e193e0473348059bc603791303fb372d6864
blob
[kousu@requiem annex-hardlinks]$ git cat-file -p 49a7e193e0473348059bc603791303fb372d6864 | head
#!/bin/sh
## inputs
FILE="sub-amu01/dwi/sub-amu01_dwi.nii.gz" # the target files to work with
## utils
canonicalize_ls() {
# this is a bit hacky
Here's then interrogating the same contents without git; the weird string is [a necessary hack]()
[kousu@requiem annex-hardlinks]$ cat <(printf "\x1f\x8b\x08\x00\x00\x00\x00\x00") .git/objects/49/a7e193e0473348059bc603791303fb372d6864 | gzip -dc
gzip: blob 5464#!/bin/sh
[...]
pwd # DEBUG
gzip: stdin: unexpected end of file
This also works, and without the 'unexpected end of file':
[kousu@requiem annex-hardlinks]$ cat .git/objects/49/a7e193e0473348059bc603791303fb372d6864 | pigz -dz
blob 5464#!/bin/sh
[...]
pwd # DEBUG
This didn't:
[kousu@requiem annex-hardlinks]$ pigz -dz .git/objects/49/a7e193e0473348059bc603791303fb372d6864
pigz: skipping: .git/objects/49/a7e193e0473348059bc603791303fb372d6864 does not have compressed suffix
zlib
!= gzip
(despite... using gzip above? I'm confused); I guess it must be better, but there's not as many CLI tools that can handle it, oddly.
Anyway, so we'd need to disable compression/packfiles for the .nii.gz files for this to work. But I bet that's possible, .gitattributes is pretty flexible.
ohhh hey here's a thread about exactly this: http://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.html
I'm (ab)using Git to store my media files, i.e. digicam pictures (*.jpg) and the like. This way I can e.g. comment a series of pictures without installing and learning a special purpose "Photo Archiving" tool. Gitk shows the roadmap!
but no good answer in there. Hm.
Here's a test run:
kousu@ail:~/src/neuropoly$ mkdir t
kousu@ail:~/src/neuropoly$ cd t
kousu@ail:~/src/neuropoly/t$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/t/.git/
kousu@ail:~/src/neuropoly/t$ git config core.compression 0
kousu@ail:~/src/neuropoly/t$ git config core.looseCompression 0
kousu@ail:~/src/neuropoly/t$
kousu@ail:~/src/neuropoly/t$ git config packed.compression 0
kousu@ail:~/src/neuropoly/t$ git config pack.compression 0
kousu@ail:~/src/neuropoly/t$ git help config
kousu@ail:~/src/neuropoly/t$ git config pack.window 0
kousu@ail:~/src/neuropoly/t$ touch ^C
kousu@ail:~/src/neuropoly/t$ vi README.md
kousu@ail:~/src/neuropoly/t$ git add README.md
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
fatal: Not a valid object name HEAD
kousu@ail:~/src/neuropoly/t$ git ls-tree --staged
error: unknown option `staged'
usage: git ls-tree [<options>] <tree-ish> [<path>...]
-d only show trees
-r recurse into subtrees
-t show trees when recursing
-z terminate entries with NUL byte
-l, --long include object size
--name-only list only filenames
--name-status list only filenames
--full-name use full path names
--full-tree list entire tree; not just current directory (implies --full-name)
--abbrev[=<n>] use <n> digits to display SHA-1s
kousu@ail:~/src/neuropoly/t$ git commit -m "ff"
[master (root-commit) 0877299] ff
1 file changed, 1 insertion(+)
create mode 100644 README.md
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
100644 blob 4c9521dffe17f7d571a2cc683fb33440d8738072 README.md
kousu@ail:~/src/neuropoly/t$ ls .git/objects/
08/ 4c/ 5d/ info/ pack/
kousu@ail:~/src/neuropoly/t$ file .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
.git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072: zlib compressed data
kousu@ail:~/src/neuropoly/t$ ls -lh .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
-r--r--r-- 1 kousu kousu 39 Apr 15 14:06 .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
kousu@ail:~/src/neuropoly/t$ stat .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
File: .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
Size: 39 Blocks: 8 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 874548 Links: 1
Access: (0444/-r--r--r--) Uid: ( 1000/ kousu) Gid: ( 1000/ kousu)
Access: 2021-04-15 14:06:50.273530212 -0400
Modify: 2021-04-15 14:06:23.452367482 -0400
Change: 2021-04-15 14:06:23.453367078 -0400
Birth: -
kousu@ail:~/src/neuropoly/t$ stat README.md
File: README.md
Size: 20 Blocks: 8 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 874547 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1000/ kousu) Gid: ( 1000/ kousu)
Access: 2021-04-15 14:06:23.450368290 -0400
Modify: 2021-04-15 14:06:22.083920591 -0400
Change: 2021-04-15 14:06:22.102912914 -0400
Birth: -
kousu@ail:~/src/neuropoly/t$ git config
usage: git config [<options>]
Config file location
--global use global config file
--system use system config file
--local use repository config file
-f, --file <file> use given config file
--blob <blob-id> read config from given blob object
Action
--get get value: name [value-regex]
--get-all get all values: key [value-regex]
--get-regexp get values for regexp: name-regex [value-regex]
--get-urlmatch get value specific for the URL: section[.var] URL
--replace-all replace all matching variables: name value [value_regex]
--add add a new variable: name value
--unset remove a variable: name [value-regex]
--unset-all remove all matches: name [value-regex]
--rename-section rename section: old-name new-name
--remove-section remove a section: name
-l, --list list all
-e, --edit open an editor
--get-color find the color configured: slot [default]
--get-colorbool find the color setting: slot [stdout-is-tty]
Type
--bool value is "true" or "false"
--int value is decimal number
--bool-or-int value is --bool or --int
--path value is a path (file or directory name)
--expiry-date value is an expiry date
Other
-z, --null terminate values with NUL byte
--name-only show variable names only
--includes respect include directives on lookup
--show-origin show origin of config (file, standard input, blob, command line)
kousu@ail:~/src/neuropoly/t$ git config -l
user.email=nick@kousu.ca
user.name=Nick
push.default=simple
merge.ff=only
diff.gpg.textconv=gpg -d --no-tty
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
filter.lfs.clean=git-lfs clean -- %f
fetch.prune=true
fetch.prunetags=true
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.compression=0
core.loosecompression=0
packed.compression=0
pack.compression=0
pack.window=0
core.compression 0
didn't work??
Maybe more tips in this thread https://public-inbox.org/git/20100514051049.GF6075@coredump.intra.peff.net/
Is there a trick to getting git to simply "copy files as is"? In other words, don't attempt to compress them, don't attempt to "diff" them, just store/copy/transfer the files as-is?
Hopefully you can pick out the answer to that question from the above statements. :)
So I tried -delta
and it got closer!
kousu@ail:~/src/neuropoly/t$ vi .gitattributes
kousu@ail:~/src/neuropoly/t$ git add .gitattributes
kousu@ail:~/src/neuropoly/t$ git commit -m "attrs"
[master a83db84] attrs
1 file changed, 1 insertion(+)
create mode 100644 .gitattributes
kousu@ail:~/src/neuropoly/t$ cat .gitattributes
*.txt -delta
kousu@ail:~/src/neuropoly/t$ vi lol.txt
kousu@ail:~/src/neuropoly/t$ git add lol.txt
kousu@ail:~/src/neuropoly/t$ git commit -m "lol"
[master 7f4909e] lol
1 file changed, 2 insertions(+)
create mode 100644 lol.txt
kousu@ail:~/src/neuropoly/t$ git ^C
kousu@ail:~/src/neuropoly/t$ ls
lol.txt README.md
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
100644 blob 85e6910cb39f0a51e2fc52517d6b902e142a442e .gitattributes
100644 blob 4c9521dffe17f7d571a2cc683fb33440d8738072 README.md
100644 blob e33ed36f613eba1484cff5c2f78b34c1ab88baaf lol.txt
kousu@ail:~/src/neuropoly/t$ git cat-file -t e33ed36f613eba1484cff5c2f78b34c1ab88baaf
blob
kousu@ail:~/src/neuropoly/t$ git cat-file -p e33ed36f613eba1484cff5c2f78b34c1ab88baaf
la la la lal
stuff things pieces
kousu@ail:~/src/neuropoly/t$ git cat-file -p e33ed36f613eba1484cff5c2f78b34c1ab88baaf^C
kousu@ail:~/src/neuropoly/t$ ls .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf
.git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf
kousu@ail:~/src/neuropoly/t$ file .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf
.git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf: zlib compressed data
kousu@ail:~/src/neuropoly/t$ cat .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf
x+��blob 35 la la la lal
stuff things pieces
It still shoved a little header on top though. Rude.
This was linked down in the thread; dead link but bless the wayback machine: http://web.archive.org/web/20110109112717/http://www.mentby.com/Group/git/how-to-prevent-git-from-compressing-certain-files.html
ah but it's just the other thread again. Drat.
Obligatory xkcd: https://xkcd.com/979/
Maybe git config core.bigFileThreshold 1
?
No, didn't seem to work:
kousu@ail:~/src/neuropoly/t3$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/t3/.git/
kousu@ail:~/src/neuropoly/t3$ git config core.bigFileThreshold 1
kousu@ail:~/src/neuropoly/t3$ git ^C
kousu@ail:~/src/neuropoly/t3$ echo lololol > README.md
kousu@ail:~/src/neuropoly/t3$ git add README.md
kousu@ail:~/src/neuropoly/t3$ git commit -m "lol"
[master (root-commit) c40fbf3] lol
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README.md
kousu@ail:~/src/neuropoly/t3$ ls .git/objects/
13/ c4/ info/ pack/
kousu@ail:~/src/neuropoly/t3$ ls .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811
.git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811
kousu@ail:~/src/neuropoly/t3$ file .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811
.git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811: zlib compressed data
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811
x+)JMU06g040031rut�u��Ma�>��I��Խ��#K���{`�!E�kousu@ail:~/src/neuropoly/t3$ cat .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811 ^C
kousu@ail:~/src/neuropoly/t3$ ^C
kousu@ail:~/src/neuropoly/t3$ ^C
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/
13/ c4/ info/ pack/
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/c4/0fbf3e0ede1edeb8cfb1b0619287b211d03e89
x��Q
�0
�a�{�\`#i�A��i온+h{��N������ʾ/���G�@�BdK��U樉�PNދ$����
��V����b+
���Z����#P���nȈ�~����n+��ʬ,Xkousu@ail:~/src/neuropoly/t3$
More clues in https://git-scm.com/book/en/v2/Git-Internals-Packfiles
Oh by the way, there's core.sparseCheckout
which looks like it might replace feature 2.ii., even for end-users. It's harder to use than 'git annex get files that i want' but I think we can assume it won't be done that often.
I think the way to think about feature 4 is it is a tradeoff between different kinds of compression.
There's four kinds of compression involved:
On our images, pre-compressed with .gz, using zlib compression is useless, even counter-productive but using hardlinks saves about half the space. But they are incompatible: you can't just directly link to a compressed version because it's a different version. And git uses zlib always. Even if you set core.compression 0
it still wraps content in a zlib header.
So we would first need to patch git to add a +loose
or +direct
flag for specific files, which would get them stored uncompressed without a header.
Similarly delta compression is incompatible hardlink compression because there's generally only one full copy of a delta-compressed file, and the other versions are stored as diffs to that file.
So we would need to disable delta compression, which should be doable with the -delta
gitattribute. But I'm not clear how/when
Maybe you could keep delta compression on and just hardlink to the base version, if it happened be the version you were checking out. In our use case, our files don't change very often, and we generally only want to look at the most recent one anyway. Perhaps we could either
a. always repack so that the base version is the most recent and the past ones are deltas off that; if you check out an older one you don't get the benefit of a hardlink but in the common case you do
b. or at git checkout $ref
, do an implicit git repack --rebase-to $ref
which combs through all the files and repacks them so $ref is the base version; this will be slow, but probably not much slower than the status quo of unpacking every file every time, and again on our workload manageable because our files don't change that much
Supporting hardlinks can cause corruption though, and annex.thin
warns about this too: if you make a series of commits to a file without pushing the intermediates will be lost forever. Which, is probably fine for us, we probaly don't want to be publishing intermediate versions, since the files are so large.
Here's a completely different approach to accomplishing goal 4 (emulating annex.thin
):
instead of checking out the files, mount them:
Any of these should avoid having to physically copy anything. They probably decompress the zlib content on the fly. Maybe not so good for contributions, but for read-only like testing in CI or just doing a processing run? This could save a lot of time and space.
Moreover, I wonder if we could fork one of these so that they mount appearing to be a regular git folder, but featuring:
git cat-file -p
(or equivalent) to read contentsPinning this for later: it's possible, and not even that hard, to write alternate git backends, e.g. https://github.com/anishathalye/git-remote-dropbox/blob/01b630ab697d9b9423915e88e43dd24072e0d591/git_remote_dropbox/helper.py#L92 adds dropbox://
to git
, so you can git clone dropbox://my-repos/project1
.
We could probably exploit this if we wanted to, say, retrofit annex's S3 remote into plain git. We could still have the benefit of a major CDN backing the bulk of our data without all the complication of tracking it all with git-annex.
Lessons 'bout partial clones:
git config --global uploadpack.allowFilter true
on the server under the git@ account that manages the repos (not obviously documented; I found it via this SO answer https://stackoverflow.com/a/52916879)--sparse
option: https://docs.gitlab.com/ee/topics/git/partial_clone.html--sparse
and promisor remotes are supposed to interact. They seem to solve the same problem don't they?\The trick is:
git clone --filter=blob:none --no-checkout ... && cd repo && git checkout master -- paths/to check/ out/
--filter=blob:none
means nothing is downloaded until specifically requested by a checkout, and --no-checkout
ensures that the first checkout doesn't happen
to see what folders you could download (analogous to looking for dangling symlinks in git-annex <=v7 or files with SHA256 hashes in them in git-annex v8) use ls-tree
:
git ls-tree master -- ./
The trick above has a performance problem, explored in https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository/52269934#52269934, https://github.com/isaacs/github/issues/1888: it successfully does a partial download, but it doesn't batch the objects it does download, instead making a new connection for each one:
git@data:~$ time ( git clone --filter=blob:none --no-checkout https://github.com/cirosantilli/test-git-partial-clone-big-small && cd test-git-partial-clone-big-small && git checkout master -- small )
Cloning into 'test-git-partial-clone-big-small'...
remote: Enumerating objects: 4, done.
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4
Receiving objects: 100% (4/4), 10.02 KiB | 10.02 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 42 bytes | 42.00 KiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
[...]
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 42 bytes | 6.00 KiB/s, done.
real 1m54.662s
user 0m0.180s
sys 0m0.358s
(full log.txt)
The git developers want to push people towards using --sparse
: https://github.com/isaacs/github/issues/1888#issuecomment-760484623
That "extra state" allows Git to do things like batch object requests in a partial clone in a sane way.
Sparse-checkout and partial clone are actively being developed to work more closely together, so you're more likely to have success in that direction.
And indeed if we try that the download is much faster:
git@data:~$ time ( git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small && cd test-git-partial-clone-big-small && git sparse-checkout add small )
Cloning into 'test-git-partial-clone-big-small'...
remote: Enumerating objects: 4, done.
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4
Receiving objects: 100% (4/4), 10.02 KiB | 10.02 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 582 bytes | 582.00 KiB/s, done.
remote: Enumerating objects: 253, done.
remote: Total 253 (delta 0), reused 0 (delta 0), pack-reused 253
Receiving objects: 100% (253/253), 2.50 KiB | 2.50 MiB/s, done.
real 0m1.523s
user 0m0.065s
sys 0m0.045s
git@data:~$ ls test-git-partial-clone-big-small/
generate.sh small
git@data:~$ du -hs test-git-partial-clone-big-small/
4.3M test-git-partial-clone-big-small/
There's a catch: clone --sparse
and sparse-checkout add
weren't added until at least git 2.27. Ubuntu 20.04 LTS, which is on most of our internal machines, is only at 2.25.
I have a workaround: you can write git 2.27's
git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small && \
cd test-git-partial-clone-big-small && \
git sparse-checkout add small
as git 2.25's
git clone --filter=blob:none --no-checkout https://github.com/cirosantilli/test-git-partial-clone-big-small && \
cd test-git-partial-clone-big-small && \
git sparse-checkout init && \
( git sparse-checkout list; echo small) | git sparse-checkout set --stdin
Unfortunately the 2.25 version does not work on 2.27: it doesn't fill in any files. I'm trying to figure out if there's an extra git
line or two we could add to make a line that works universally.
This version works on both, but it's much more verbose:
mkdir test-git-partial-clone-big-small && \
cd test-git-partial-clone-big-small && \
git init && \
git remote add origin https://github.com/cirosantilli/test-git-partial-clone-big-small && \
git config remote.origin.promisor true && \
git config remote.origin.partialclonefilter blob:none && \
git sparse-checkout init && \
echo This next step is 'git sparse-checkout add small' implemented on an older git && \
( echo small >> .git/info/sparse-checkout ) && \
git pull origin master
and this version suffers from lacking sparse-checkout add
; it relies on git pull
to trigger actually downloading the files, and I can't figure out how to retrigger it once the pull
has happened once:
Another problem with 2.25 is that a simple 'git status' will trigger a download of missing .git/objects/
files (in both versions git log --stat
or git log -p
will also trigger such a download, but that's more understandable: that needs to observe the contents of the old versions to produce diffs/diffstats). A third problem is that 2.25 is missing sparse-checkout --cone
, which makes the sparseness selected by folders, instead of by individual regexes on files, and we probably want to be using it because it's a big performance drag to run regexes over every folder. So I'm thinking git 2.25 is a dead-end.
So we have two options:
clone --sparse
and sparse-checkout add
-- but...maybe impossible?Before I forget: the difference between sparse-checkout
and --filter
is:
--filter
refers to .git/objects/
; it controls what is downloaded by defaultbut the relationship is --filter
is overridden by objects needed for the current checkout, i.e. the current contents of the working directory.
Thus:
By default git downloads the entire git log and every version of every file to .git/objects
, and copies the ones needed for master
to the working directory.
With --filter
it only downloads the .git/objects
needed for master
.
With --sparse --filter
it only downloads the .git/objects
in the root directory of master
, then it will lazily add anything requested with with git sparse-checkout add
.
So a partial clone equivalent to git annex get paths/ to/ download/
must use --sparse --filter
in tandem. But that's fine, that's still less complicated than installing an entire extra app.
Ah what's this, on 2.25:
u108545@joplin:~/plain-git$ git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small
Cloning into 'test-git-partial-clone-big-small'...
fatal: cannot change to 'https://github.com/cirosantilli/test-git-partial-clone-big-small': No such file or directory
error: failed to initialize sparse-checkout
2.25's manpage doesn't document clone --sparse
, but it looks like it's trying to do it, but just..failing? Maybe there's a way to get it to work afterall.
But I dug into the code between 2.25 and the version that works and found
- if (option_sparse_checkout && git_sparse_checkout_init(repo))
+ if (option_sparse_checkout && git_sparse_checkout_init(dir))
return 1;
so it was just an obvious bug, an oversight. And I don't think any amount of command line trickery is going to make repo = dir
, not without breaking the rest of the clone anyway.
I should also throw in that:
clone --depth $n
(+ fetch --deepen
) is another feature that can be used in tandem It's, unfortunately, not orthogonal to --sparse
nor --filter
. It is a different kind of partial clone: it clones only the .git/objects/
needed to get $n
commits of history deep. My instinct is --filter
is a more general solution, but --depth
will prevent accidental unintended downloads (e.g. caused by running git log --stat
on the complete history), and it's older and maybe better supported, and anyway we can use all three flags at once to absolutely minimize the necessary bandwidth and storage.
Pinning this for later: it's possible, and not even that hard, to write alternate git backends, e.g. https://github.com/anishathalye/git-remote-dropbox/blob/01b630ab697d9b9423915e88e43dd24072e0d591/git_remote_dropbox/helper.py#L92 adds
dropbox://
togit
, so you cangit clone dropbox://my-repos/project1
.We could probably exploit this if we wanted to, say, retrofit annex's S3 remote into plain git. We could still have the benefit of a major CDN backing the bulk of our data without all the complication of tracking it all with git-annex.
To emphasize: this is a way to achieve goal 3: distribution through a CDN. I wrote it originally as "mix and match servers"; but that's not the real goal; the real goal is to cut distribution costs, and after a year of dealing with it I think that mixing and matching servers the way datalad
/git-annex
encourage is just a recipe for making everything extremely confusing and broken in the long run.
I think what we should do is find/write git-remote-s3
so you can do git remote add s3://my-bucket-name
, upload the entire repo, trees and commits and all, there, and then that can be the primary mirror people get our dataset from. Installing and configuring git-remote-s3
won't be any harder than installing and configuring all of git-annex
, datalad
, rclone
or git-annex
, datalad
, and awscli
.
TODO: @jcohenadad wants me to actually do a practical test on git@data.neuro.polymtl.ca:datasets/sci-zurich
.
I think I will try to make a copy (sci-zurich-git?
) and rebase it to have the same history just without the annex.
Some more motivation for doing this: a report from a collaborator today that
For 3 weeks, I have been urging the server admins to install git-annex but without any response.
git is more commonly known, so, there you go.
I found apps today that implement git-as-virtual-filesystem (writeup) today.
These could be used to implement reproducible pipelines by, as first step, mounting them and then embedding the path (i.e. including the commit ID) into them; it has the big advantage for reproducibility that every, and the other big advantage that, if combined with sshfs
or maybe webdav
, you should be able to lazily download data, which avoids the entire headache of using git-annex or git-lfs.
Like #31, this is a proposal to drop
git-annex
.What abilities does
git-annex
give us?git annex get sub-amu* sub-beijing*/
).git config annex.thin
, which means a checked-out dataset only uses the space it's using instead of the default git behaviour of doubling the space usedgit-lfs
does not even have this featureCounterarguments:
There are two use cases for partial datasets:
git clone --depth 1
andgit fetch --depth 1 <branch>
allow you to download only the files needed for the latest version. So that takes care of use case #1. And then there's--deepen
if you do need to go back in time after the fact.partial-clone
shares a lot of the fundamentals withgit-annex
but, being integrated directly into git and designed by the git team, will be much less glitch-proneTwo responses:
How often do you actually want to download a subdataset? Our instructions to users so far have always been
When you really do need it, you can set it up with plain git, it just has to be done on a branch ahead of time.
Then to use it, on a different server/machine:
This puts the onus for setting up subdatasets onto the admins, the people who are keeping full copies of the dataset handy to operate on. But I think that's manageable because, again, how often do we really set up sub-datasets? And plus, this way, work is reproducible because the branch is saved and shared! datalad's recommendation that each user should be responsible for picking out the parts of a dataset they are interested in is fragile.
annex.thin
. That's a tricky one.git relink
but it seems abanoned, and it operated cross-repo, deduplicating files between multiple .git/objects/ folders, .git/objects-to-checkout, so it is more the analog ofannex.hardlink
.git-annex
.