neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Considering plain `git` #68

Open kousu opened 3 years ago

kousu commented 3 years ago

Like #31, this is a proposal to drop git-annex.

What abilities does git-annex give us?

  1. It lets us integrate with the neuroinformatics community that is trying to standardize on https://datalad.org (e.g. https://openneuro.org, https://gin.g-node.org/)
  2. It allows partial dataset downlaods. I think this falls into two separate use-cases which are actually rather distinct.
    1. So that users don't need to download deleted versions of images just to use the current dataset
    2. So that users can't pick out subsets of the current dataset (git annex get sub-amu* sub-beijing*/).
  3. It allows us to mix and match servers
    • so, we can keep the metadata on Github but the bulk of our data on Amazon, which is cheaper and more expandable
  4. It supports git config annex.thin, which means a checked-out dataset only uses the space it's using instead of the default git behaviour of doubling the space used
    • notably, git-lfs does not even have this feature

Counterarguments:

  1. Most labs aren't even using git at all yet. Any version tracking is considered a huge boon, so if we chart a different course than datalad we can teach them the way. Also GIN, which seems to be by far the most reliable of the git-based neuroinformatics servers, is just Gogs so it should be compatible with plain git.
  2. There are two use cases for partial datasets:

    1. git clone --depth 1 and git fetch --depth 1 <branch> allow you to download only the files needed for the latest version. So that takes care of use case #1. And then there's --deepen if you do need to go back in time after the fact.
      • furthermore git is working on integrating this directly into git: https://git-scm.com/docs/partial-clone. It looks like partial-clone shares a lot of the fundamentals with git-annex but, being integrated directly into git and designed by the git team, will be much less glitch-prone
    2. Two responses:

      1. How often do you actually want to download a subdataset? Our instructions to users so far have always been

        git annex init; git annex get .
      2. When you really do need it, you can set it up with plain git, it just has to be done on a branch ahead of time.

        test ! -d my-dataset && git clone git@whatever:my-dataset ## *slow*, one time step
        cd my-dataset
        git checkout -b my-sub-dataset
        shopt -s extglob
        git rm -r !(sub-{amu,beijing}*) # keep just -amu* and -beijing*
        git push -u origin my-sub-dataset

        Then to use it, on a different server/machine:

        git clone -b my-sub-dataset --depth 1 git@whatever:my-dataset

        This puts the onus for setting up subdatasets onto the admins, the people who are keeping full copies of the dataset handy to operate on. But I think that's manageable because, again, how often do we really set up sub-datasets? And plus, this way, work is reproducible because the branch is saved and shared! datalad's recommendation that each user should be responsible for picking out the parts of a dataset they are interested in is fragile.

  3. If we're paying for hosting anyway (which we have to, our data is too large for anything else) we might as well pay for full git hosting (i.e. pay GIN, or maybe MIC-UNF, or pay for our own server to put GIN on), and keep the metadata with the bulk of the data. Keeping everything together makes it much easier to archive safely (even git-lfs recognizes this: a git-lfs URL is implied from the git remote URL and github and gitlab and I think gogs' git-lfs implementations store the LFS files next to the repo).
    • If this is just because we want to have a presence on Github then we can put an empty repo here that links to the GIN or our own Gogs server
    • Counter-counterargument: AWS S3 has cheaper bandwidth than VPS plans, and can be fronted by a CDN (though we don't currently do that) to further bring the cost down
      • Counter-counter-counterargument: git has fairly flexible 'fetch' URLs, and they can be https://; so, with a bit of research, we could probably front our repos with a standard CDN and bring our costs down.
  4. I don't have a solution for porting annex.thin. That's a tricky one.
    • I found git relink but it seems abanoned, and it operated cross-repo, deduplicating files between multiple .git/objects/ folders, .git/objects-to-checkout, so it is more the analog of annex.hardlink.
    • But maybe we can dig it up and salvage it?
    • Or maybe we can write a smudge/clean filter that replaces files with their hardlinks? It would probably be tricky but I think it's possible and worth investing in and sharing with the world.
    • This might be impossible because git objects don't necessarily correspond directly to files. They can be compressed, maybe merged with other objects? I'm not really too sure. But I would learn the innards of git to solve this if it meant we could stop using git-annex.
jcohenadad commented 3 years ago

Thank you for opening this thread, @kousu. It clarifies very well where we are at in terms of git-annex usage. I would add to the counterargument list the many issues we've encountered with git-annex's version (v8 vs. v7 vs. v6, etc.): people not realizing that version requirement (this will happen, again and again), or not being able to download the proper version because of their distros, and the conda workaround is good for ppl who are familiar with the terminal, but not for my grandmother.

git, on the other hand, is more reliable with that respect.

kousu commented 3 years ago

@jcohenadad pointed out a way to mitigate point 4, about annex.thin. Last summer when we were trying to to keep all of our data in plain git we were telling people to just download release tarballs: https://github.com/spine-generic/data-single-subject_DO-NOT-USE#download-zip-package-recommended

It's not at all unreasonable to ask users of the datasets to download releases as tarballs. Maybe even version-pinned in their software, with pooch?

It would only be contributors to datasets who would need to get the git copy and suffer through having duplicated files. Which sucks. But we can provide those people with server space they can work on if the data is too much for their personal computers.

A really handy thing about Github is that it generates release tarballs dynamically, straight out of .git, so it doesn't need to keep duplicate files around. If we were to self host (#70), well, I assume Gogs can do the same thing? Gitolite (https://github.com/neuropoly/datalad/blob/ng/gitolite/internal-server.md) can't but maybe we can hack it in using git fast-export?

kousu commented 3 years ago

note to self: unlike git-annex, git internally compresses everything in .git/objects/ with zlib, and possibly more (it has something called "packfiles" which sound a lot like they are probably a .tar.gz or something), which means it is impossible to hardlink directly into that: the objects first need to be decompressed, which is what git does when you do git checkout ...; also, for files like .nii.gz or .mp3 which are already compressed, it's a huge waste of time.

If we were to try to recover feature #4 we would have to figure out how to disable compression for our chosen files, the way we currently choose to annex our chosen files with

# .gitattributes
*.nii     filter=annex annex.largefiles=anything
*.nii.gz  filter=annex annex.largefiles=anything

In fact we might even be able to do this with .gitattributes: https://stackoverflow.com/questions/7102053/git-pull-without-remotely-compressing-objects

echo '*.nii.gz -delta' >> .gitattributes

(but -delta might only be for fetching, not for the on-disk format)

Here's an example of using git to interrogate its raw contents:

[kousu@requiem annex-hardlinks]$ git branch
  annex-cache
  annex-fix
* no-cache-hardlinks
  trunk
[kousu@requiem annex-hardlinks]$ git ls-tree annex-cache
100755 blob 49a7e193e0473348059bc603791303fb372d6864    annex-hardlinks.sh
100644 blob 92d5ba00b4e5f923cee4e39603248073a76143cd    mklogs.sh
[kousu@requiem annex-hardlinks]$ git cat-file -t 49a7e193e0473348059bc603791303fb372d6864 
blob
[kousu@requiem annex-hardlinks]$ git cat-file -p 49a7e193e0473348059bc603791303fb372d6864 | head
#!/bin/sh

## inputs

FILE="sub-amu01/dwi/sub-amu01_dwi.nii.gz" # the target files to work with

## utils

canonicalize_ls() {
  # this is a bit hacky

Here's then interrogating the same contents without git; the weird string is [a necessary hack]()

[kousu@requiem annex-hardlinks]$ cat <(printf "\x1f\x8b\x08\x00\x00\x00\x00\x00") .git/objects/49/a7e193e0473348059bc603791303fb372d6864 | gzip -dc
gzip: blob 5464#!/bin/sh

[...]

pwd # DEBUG
gzip: stdin: unexpected end of file

This also works, and without the 'unexpected end of file':

[kousu@requiem annex-hardlinks]$ cat  .git/objects/49/a7e193e0473348059bc603791303fb372d6864  | pigz -dz
blob 5464#!/bin/sh

[...]

pwd # DEBUG

This didn't:

[kousu@requiem annex-hardlinks]$ pigz -dz .git/objects/49/a7e193e0473348059bc603791303fb372d6864 
pigz: skipping: .git/objects/49/a7e193e0473348059bc603791303fb372d6864 does not have compressed suffix

zlib != gzip (despite... using gzip above? I'm confused); I guess it must be better, but there's not as many CLI tools that can handle it, oddly.

Anyway, so we'd need to disable compression/packfiles for the .nii.gz files for this to work. But I bet that's possible, .gitattributes is pretty flexible.

ohhh hey here's a thread about exactly this: http://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.html

I'm (ab)using Git to store my media files, i.e. digicam pictures (*.jpg) and the like. This way I can e.g. comment a series of pictures without installing and learning a special purpose "Photo Archiving" tool. Gitk shows the roadmap!

but no good answer in there. Hm.

Here's a test run:

kousu@ail:~/src/neuropoly$ mkdir t
kousu@ail:~/src/neuropoly$ cd t
kousu@ail:~/src/neuropoly/t$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/t/.git/
kousu@ail:~/src/neuropoly/t$ git config core.compression 0
kousu@ail:~/src/neuropoly/t$ git config core.looseCompression 0
kousu@ail:~/src/neuropoly/t$ 
kousu@ail:~/src/neuropoly/t$ git config packed.compression 0
kousu@ail:~/src/neuropoly/t$ git config pack.compression 0
kousu@ail:~/src/neuropoly/t$ git help config
kousu@ail:~/src/neuropoly/t$ git config pack.window 0
kousu@ail:~/src/neuropoly/t$ touch ^C
kousu@ail:~/src/neuropoly/t$ vi README.md
kousu@ail:~/src/neuropoly/t$ git add README.md 
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
fatal: Not a valid object name HEAD
kousu@ail:~/src/neuropoly/t$ git ls-tree --staged
error: unknown option `staged'
usage: git ls-tree [<options>] <tree-ish> [<path>...]

    -d                    only show trees
    -r                    recurse into subtrees
    -t                    show trees when recursing
    -z                    terminate entries with NUL byte
    -l, --long            include object size
    --name-only           list only filenames
    --name-status         list only filenames
    --full-name           use full path names
    --full-tree           list entire tree; not just current directory (implies --full-name)
    --abbrev[=<n>]        use <n> digits to display SHA-1s

kousu@ail:~/src/neuropoly/t$ git commit -m "ff"
[master (root-commit) 0877299] ff
 1 file changed, 1 insertion(+)
 create mode 100644 README.md
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
100644 blob 4c9521dffe17f7d571a2cc683fb33440d8738072    README.md
kousu@ail:~/src/neuropoly/t$ ls .git/objects/
08/   4c/   5d/   info/ pack/ 
kousu@ail:~/src/neuropoly/t$ file .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072 
.git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072: zlib compressed data
kousu@ail:~/src/neuropoly/t$ ls -lh .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072 
-r--r--r-- 1 kousu kousu 39 Apr 15 14:06 .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
kousu@ail:~/src/neuropoly/t$ stat .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072 
  File: .git/objects/4c/9521dffe17f7d571a2cc683fb33440d8738072
  Size: 39          Blocks: 8          IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 874548      Links: 1
Access: (0444/-r--r--r--)  Uid: ( 1000/   kousu)   Gid: ( 1000/   kousu)
Access: 2021-04-15 14:06:50.273530212 -0400
Modify: 2021-04-15 14:06:23.452367482 -0400
Change: 2021-04-15 14:06:23.453367078 -0400
 Birth: -
kousu@ail:~/src/neuropoly/t$ stat README.md 
  File: README.md
  Size: 20          Blocks: 8          IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 874547      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/   kousu)   Gid: ( 1000/   kousu)
Access: 2021-04-15 14:06:23.450368290 -0400
Modify: 2021-04-15 14:06:22.083920591 -0400
Change: 2021-04-15 14:06:22.102912914 -0400
 Birth: -
kousu@ail:~/src/neuropoly/t$ git config
usage: git config [<options>]

Config file location
    --global              use global config file
    --system              use system config file
    --local               use repository config file
    -f, --file <file>     use given config file
    --blob <blob-id>      read config from given blob object

Action
    --get                 get value: name [value-regex]
    --get-all             get all values: key [value-regex]
    --get-regexp          get values for regexp: name-regex [value-regex]
    --get-urlmatch        get value specific for the URL: section[.var] URL
    --replace-all         replace all matching variables: name value [value_regex]
    --add                 add a new variable: name value
    --unset               remove a variable: name [value-regex]
    --unset-all           remove all matches: name [value-regex]
    --rename-section      rename section: old-name new-name
    --remove-section      remove a section: name
    -l, --list            list all
    -e, --edit            open an editor
    --get-color           find the color configured: slot [default]
    --get-colorbool       find the color setting: slot [stdout-is-tty]

Type
    --bool                value is "true" or "false"
    --int                 value is decimal number
    --bool-or-int         value is --bool or --int
    --path                value is a path (file or directory name)
    --expiry-date         value is an expiry date

Other
    -z, --null            terminate values with NUL byte
    --name-only           show variable names only
    --includes            respect include directives on lookup
    --show-origin         show origin of config (file, standard input, blob, command line)

kousu@ail:~/src/neuropoly/t$ git config -l
user.email=nick@kousu.ca
user.name=Nick
push.default=simple
merge.ff=only
diff.gpg.textconv=gpg -d --no-tty
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
filter.lfs.clean=git-lfs clean -- %f
fetch.prune=true
fetch.prunetags=true
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.compression=0
core.loosecompression=0
packed.compression=0
pack.compression=0
pack.window=0

core.compression 0 didn't work??

Maybe more tips in this thread https://public-inbox.org/git/20100514051049.GF6075@coredump.intra.peff.net/

Is there a trick to getting git to simply "copy files as is"? In other words, don't attempt to compress them, don't attempt to "diff" them, just store/copy/transfer the files as-is?

Hopefully you can pick out the answer to that question from the above statements. :)

So I tried -delta and it got closer!

kousu@ail:~/src/neuropoly/t$ vi .gitattributes
kousu@ail:~/src/neuropoly/t$ git add .gitattributes 
kousu@ail:~/src/neuropoly/t$ git commit -m "attrs"
[master a83db84] attrs
 1 file changed, 1 insertion(+)
 create mode 100644 .gitattributes
kousu@ail:~/src/neuropoly/t$ cat .gitattributes 
*.txt -delta
kousu@ail:~/src/neuropoly/t$ vi lol.txt
kousu@ail:~/src/neuropoly/t$ git add lol.txt 
kousu@ail:~/src/neuropoly/t$ git commit -m "lol"
[master 7f4909e] lol
 1 file changed, 2 insertions(+)
 create mode 100644 lol.txt
kousu@ail:~/src/neuropoly/t$ git ^C
kousu@ail:~/src/neuropoly/t$ ls
lol.txt  README.md
kousu@ail:~/src/neuropoly/t$ git ls-tree HEAD
100644 blob 85e6910cb39f0a51e2fc52517d6b902e142a442e    .gitattributes
100644 blob 4c9521dffe17f7d571a2cc683fb33440d8738072    README.md
100644 blob e33ed36f613eba1484cff5c2f78b34c1ab88baaf    lol.txt
kousu@ail:~/src/neuropoly/t$ git cat-file -t e33ed36f613eba1484cff5c2f78b34c1ab88baaf
blob
kousu@ail:~/src/neuropoly/t$ git cat-file -p e33ed36f613eba1484cff5c2f78b34c1ab88baaf
 la la la lal
 stuff things pieces
kousu@ail:~/src/neuropoly/t$ git cat-file -p e33ed36f613eba1484cff5c2f78b34c1ab88baaf^C
kousu@ail:~/src/neuropoly/t$ ls .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf 
.git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf
kousu@ail:~/src/neuropoly/t$ file .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf 
.git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf: zlib compressed data
kousu@ail:~/src/neuropoly/t$ cat .git/objects/e3/3ed36f613eba1484cff5c2f78b34c1ab88baaf 
x+��blob 35 la la la lal
 stuff things pieces

It still shoved a little header on top though. Rude.

This was linked down in the thread; dead link but bless the wayback machine: http://web.archive.org/web/20110109112717/http://www.mentby.com/Group/git/how-to-prevent-git-from-compressing-certain-files.html

ah but it's just the other thread again. Drat.

Obligatory xkcd: https://xkcd.com/979/

kousu commented 3 years ago

Maybe git config core.bigFileThreshold 1?

No, didn't seem to work:

kousu@ail:~/src/neuropoly/t3$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/t3/.git/
kousu@ail:~/src/neuropoly/t3$ git config core.bigFileThreshold 1
kousu@ail:~/src/neuropoly/t3$ git ^C
kousu@ail:~/src/neuropoly/t3$ echo lololol > README.md
kousu@ail:~/src/neuropoly/t3$ git add README.md 
kousu@ail:~/src/neuropoly/t3$ git commit -m "lol"
[master (root-commit) c40fbf3] lol
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 README.md
kousu@ail:~/src/neuropoly/t3$ ls .git/objects/
13/   c4/   info/ pack/ 
kousu@ail:~/src/neuropoly/t3$ ls .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811 
.git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811
kousu@ail:~/src/neuropoly/t3$ file .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811 
.git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811: zlib compressed data
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811 
x+)JMU06g040031rut�u��Ma�>��I��Խ��#K���{`�!E�kousu@ail:~/src/neuropoly/t3$ cat .git/objects/13/f3364cb4a34a8d6ab4681eb2288b1d01859811 ^C
kousu@ail:~/src/neuropoly/t3$ ^C
kousu@ail:~/src/neuropoly/t3$ ^C
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/
13/   c4/   info/ pack/ 
kousu@ail:~/src/neuropoly/t3$ cat .git/objects/c4/0fbf3e0ede1edeb8cfb1b0619287b211d03e89 
x��Q
�0
  �a�{�\`#i�A��i온+h{��N������ʾ/���G�@�BdK��U樉�PNދ$����
                                                            ��V����b+
                                                                     ���Z����#P���nȈ�~����n+��ʬ,Xkousu@ail:~/src/neuropoly/t3$ 
kousu commented 3 years ago

More clues in https://git-scm.com/book/en/v2/Git-Internals-Packfiles

kousu commented 3 years ago

Oh by the way, there's core.sparseCheckout which looks like it might replace feature 2.ii., even for end-users. It's harder to use than 'git annex get files that i want' but I think we can assume it won't be done that often.

kousu commented 3 years ago

I think the way to think about feature 4 is it is a tradeoff between different kinds of compression.

There's four kinds of compression involved:

On our images, pre-compressed with .gz, using zlib compression is useless, even counter-productive but using hardlinks saves about half the space. But they are incompatible: you can't just directly link to a compressed version because it's a different version. And git uses zlib always. Even if you set core.compression 0 it still wraps content in a zlib header.

So we would first need to patch git to add a +loose or +direct flag for specific files, which would get them stored uncompressed without a header.

Similarly delta compression is incompatible hardlink compression because there's generally only one full copy of a delta-compressed file, and the other versions are stored as diffs to that file.

So we would need to disable delta compression, which should be doable with the -delta gitattribute. But I'm not clear how/when

Maybe you could keep delta compression on and just hardlink to the base version, if it happened be the version you were checking out. In our use case, our files don't change very often, and we generally only want to look at the most recent one anyway. Perhaps we could either

a. always repack so that the base version is the most recent and the past ones are deltas off that; if you check out an older one you don't get the benefit of a hardlink but in the common case you do b. or at git checkout $ref, do an implicit git repack --rebase-to $ref which combs through all the files and repacks them so $ref is the base version; this will be slow, but probably not much slower than the status quo of unpacking every file every time, and again on our workload manageable because our files don't change that much

Supporting hardlinks can cause corruption though, and annex.thin warns about this too: if you make a series of commits to a file without pushing the intermediates will be lost forever. Which, is probably fine for us, we probaly don't want to be publishing intermediate versions, since the files are so large.

kousu commented 3 years ago

Here's a completely different approach to accomplishing goal 4 (emulating annex.thin):

instead of checking out the files, mount them:

Any of these should avoid having to physically copy anything. They probably decompress the zlib content on the fly. Maybe not so good for contributions, but for read-only like testing in CI or just doing a processing run? This could save a lot of time and space.

Moreover, I wonder if we could fork one of these so that they mount appearing to be a regular git folder, but featuring:

kousu commented 2 years ago

Pinning this for later: it's possible, and not even that hard, to write alternate git backends, e.g. https://github.com/anishathalye/git-remote-dropbox/blob/01b630ab697d9b9423915e88e43dd24072e0d591/git_remote_dropbox/helper.py#L92 adds dropbox:// to git, so you can git clone dropbox://my-repos/project1.

We could probably exploit this if we wanted to, say, retrofit annex's S3 remote into plain git. We could still have the benefit of a major CDN backing the bulk of our data without all the complication of tracking it all with git-annex.

kousu commented 2 years ago

Lessons 'bout partial clones:

The trick is:

git clone --filter=blob:none --no-checkout ... && cd repo && git checkout master -- paths/to check/ out/

--filter=blob:none means nothing is downloaded until specifically requested by a checkout, and --no-checkout ensures that the first checkout doesn't happen

to see what folders you could download (analogous to looking for dangling symlinks in git-annex <=v7 or files with SHA256 hashes in them in git-annex v8) use ls-tree:

git ls-tree master -- ./
kousu commented 2 years ago

The trick above has a performance problem, explored in https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository/52269934#52269934, https://github.com/isaacs/github/issues/1888: it successfully does a partial download, but it doesn't batch the objects it does download, instead making a new connection for each one:

git@data:~$ time ( git clone --filter=blob:none --no-checkout https://github.com/cirosantilli/test-git-partial-clone-big-small &&   cd test-git-partial-clone-big-small &&   git checkout master -- small )
Cloning into 'test-git-partial-clone-big-small'...
remote: Enumerating objects: 4, done.
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4
Receiving objects: 100% (4/4), 10.02 KiB | 10.02 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 42 bytes | 42.00 KiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
[...]
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 42 bytes | 6.00 KiB/s, done.

real    1m54.662s
user    0m0.180s
sys 0m0.358s

(full log.txt)

The git developers want to push people towards using --sparse: https://github.com/isaacs/github/issues/1888#issuecomment-760484623

That "extra state" allows Git to do things like batch object requests in a partial clone in a sane way.

Sparse-checkout and partial clone are actively being developed to work more closely together, so you're more likely to have success in that direction.

And indeed if we try that the download is much faster:

git@data:~$ time ( git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small &&   cd test-git-partial-clone-big-small &&   git sparse-checkout add small )
Cloning into 'test-git-partial-clone-big-small'...
remote: Enumerating objects: 4, done.
remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4
Receiving objects: 100% (4/4), 10.02 KiB | 10.02 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 582 bytes | 582.00 KiB/s, done.
remote: Enumerating objects: 253, done.
remote: Total 253 (delta 0), reused 0 (delta 0), pack-reused 253
Receiving objects: 100% (253/253), 2.50 KiB | 2.50 MiB/s, done.

real    0m1.523s
user    0m0.065s
sys 0m0.045s
git@data:~$ ls test-git-partial-clone-big-small/
generate.sh  small
git@data:~$ du -hs test-git-partial-clone-big-small/
4.3M    test-git-partial-clone-big-small/
kousu commented 2 years ago

There's a catch: clone --sparse and sparse-checkout add weren't added until at least git 2.27. Ubuntu 20.04 LTS, which is on most of our internal machines, is only at 2.25.

I have a workaround: you can write git 2.27's

git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small && \
  cd test-git-partial-clone-big-small && \
  git sparse-checkout add small

as git 2.25's

git clone --filter=blob:none --no-checkout https://github.com/cirosantilli/test-git-partial-clone-big-small && \
  cd test-git-partial-clone-big-small && \
  git sparse-checkout init && \
  ( git sparse-checkout list; echo small) | git sparse-checkout set --stdin

Unfortunately the 2.25 version does not work on 2.27: it doesn't fill in any files. I'm trying to figure out if there's an extra git line or two we could add to make a line that works universally.

This version works on both, but it's much more verbose:

mkdir test-git-partial-clone-big-small && \
 cd test-git-partial-clone-big-small && \
 git init && \
 git remote add origin https://github.com/cirosantilli/test-git-partial-clone-big-small && \
 git config remote.origin.promisor true && \
 git config remote.origin.partialclonefilter blob:none && \
  git sparse-checkout init && \
  echo This next step is 'git sparse-checkout add small' implemented on an older git && \
  ( echo small >> .git/info/sparse-checkout ) && \
  git pull origin master

and this version suffers from lacking sparse-checkout add; it relies on git pull to trigger actually downloading the files, and I can't figure out how to retrigger it once the pull has happened once:

``` u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ echo large >> .git/info/sparse-checkout u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git sparse-checkout list /* !/*/ small large u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git reset --hard master HEAD is now at b817293 0 u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ ls generate.sh small u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git sparse-checkout list | git sparse-checkout set --stdin u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ ls generate.sh small u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git pull origin master From https://github.com/cirosantilli/test-git-partial-clone-big-small * branch master -> FETCH_HEAD Already up to date. u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ ls generate.sh small u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git checkout -b other Switched to a new branch 'other' u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ git checkout master Switched to branch 'master' u108545@joplin:~/plain-git/test-git-partial-clone-big-small/test-git-partial-clone-big-small$ ls generate.sh small ```

Another problem with 2.25 is that a simple 'git status' will trigger a download of missing .git/objects/ files (in both versions git log --stat or git log -p will also trigger such a download, but that's more understandable: that needs to observe the contents of the old versions to produce diffs/diffstats). A third problem is that 2.25 is missing sparse-checkout --cone, which makes the sparseness selected by folders, instead of by individual regexes on files, and we probably want to be using it because it's a big performance drag to run regexes over every folder. So I'm thinking git 2.25 is a dead-end.

So we have two options:

kousu commented 2 years ago

Before I forget: the difference between sparse-checkout and --filter is:

but the relationship is --filter is overridden by objects needed for the current checkout, i.e. the current contents of the working directory.

Thus:

By default git downloads the entire git log and every version of every file to .git/objects, and copies the ones needed for master to the working directory.

With --filter it only downloads the .git/objects needed for master.

With --sparse --filter it only downloads the .git/objects in the root directory of master, then it will lazily add anything requested with with git sparse-checkout add.

So a partial clone equivalent to git annex get paths/ to/ download/ must use --sparse --filter in tandem. But that's fine, that's still less complicated than installing an entire extra app.

kousu commented 2 years ago

Ah what's this, on 2.25:

u108545@joplin:~/plain-git$ git clone --filter=blob:none --sparse https://github.com/cirosantilli/test-git-partial-clone-big-small
Cloning into 'test-git-partial-clone-big-small'...
fatal: cannot change to 'https://github.com/cirosantilli/test-git-partial-clone-big-small': No such file or directory
error: failed to initialize sparse-checkout

2.25's manpage doesn't document clone --sparse, but it looks like it's trying to do it, but just..failing? Maybe there's a way to get it to work afterall.

But I dug into the code between 2.25 and the version that works and found

-   if (option_sparse_checkout && git_sparse_checkout_init(repo))
+   if (option_sparse_checkout && git_sparse_checkout_init(dir))
        return 1;

so it was just an obvious bug, an oversight. And I don't think any amount of command line trickery is going to make repo = dir, not without breaking the rest of the clone anyway.

kousu commented 2 years ago

I should also throw in that:

clone --depth $n (+ fetch --deepen) is another feature that can be used in tandem It's, unfortunately, not orthogonal to --sparse nor --filter. It is a different kind of partial clone: it clones only the .git/objects/ needed to get $n commits of history deep. My instinct is --filter is a more general solution, but --depth will prevent accidental unintended downloads (e.g. caused by running git log --stat on the complete history), and it's older and maybe better supported, and anyway we can use all three flags at once to absolutely minimize the necessary bandwidth and storage.

kousu commented 2 years ago

Pinning this for later: it's possible, and not even that hard, to write alternate git backends, e.g. https://github.com/anishathalye/git-remote-dropbox/blob/01b630ab697d9b9423915e88e43dd24072e0d591/git_remote_dropbox/helper.py#L92 adds dropbox:// to git, so you can git clone dropbox://my-repos/project1.

We could probably exploit this if we wanted to, say, retrofit annex's S3 remote into plain git. We could still have the benefit of a major CDN backing the bulk of our data without all the complication of tracking it all with git-annex.

To emphasize: this is a way to achieve goal 3: distribution through a CDN. I wrote it originally as "mix and match servers"; but that's not the real goal; the real goal is to cut distribution costs, and after a year of dealing with it I think that mixing and matching servers the way datalad/git-annex encourage is just a recipe for making everything extremely confusing and broken in the long run.

I think what we should do is find/write git-remote-s3 so you can do git remote add s3://my-bucket-name, upload the entire repo, trees and commits and all, there, and then that can be the primary mirror people get our dataset from. Installing and configuring git-remote-s3 won't be any harder than installing and configuring all of git-annex, datalad, rclone or git-annex, datalad, and awscli.

kousu commented 2 years ago

TODO: @jcohenadad wants me to actually do a practical test on git@data.neuro.polymtl.ca:datasets/sci-zurich.

I think I will try to make a copy (sci-zurich-git?) and rebase it to have the same history just without the annex.

kousu commented 2 years ago

Some more motivation for doing this: a report from a collaborator today that

For 3 weeks, I have been urging the server admins to install git-annex but without any response.

git is more commonly known, so, there you go.

kousu commented 8 months ago

I found apps today that implement git-as-virtual-filesystem (writeup) today.

These could be used to implement reproducible pipelines by, as first step, mounting them and then embedding the path (i.e. including the commit ID) into them; it has the big advantage for reproducibility that every, and the other big advantage that, if combined with sshfs or maybe webdav, you should be able to lazily download data, which avoids the entire headache of using git-annex or git-lfs.