milahu / git-bug-git-mv-wasteful-transfer

MIT License
0 stars 0 forks source link

git mv fails to deduplicate blob objects on transfer #2

Open milahu opened 6 months ago

milahu commented 6 months ago

"git mv" followed by "git push" or "git pull" can produce wasteful transfers of blob objects

these transfers are wasteful because the blob object already exists in the destination repo but "git push" or "git pull" fail to see that

this affects only some cases of "git mv" in some cases, the deduplication works as expected in other cases, dedup fails

this is neglegible for small files, but noticable with large files

in my case, i tried to move 5GB of files (250 x 20MB) and i was surprised as "git push" wanted to transfer 5GB instead of a few bytes for the tree and commit objects

to reproduce: see repro.sh in https://github.com/milahu/git-bug-git-mv-wasteful-transfer

output of repro.sh

the first size is the transfer size before "git mv" the second size is the transfer size after "git mv"

pass: 1.00 MiB != 288 bytes # path_a=file_a; path_b=file_b pass: 1.00 MiB != 331 bytes # path_a=dir/file_a; path_b=dir/file_b pass: 1.00 MiB != 286 bytes # path_a=dir_a/file; path_b=dir_b/file pass: 1.00 MiB != 284 bytes # path_a=file; path_b=dir/file pass: 1.00 MiB != 329 bytes # path_a=file; path_b=dir1/dir2/file pass: 1.00 MiB != 373 bytes # path_a=file; path_b=dir1/dir2/dir3/file pass: 1.00 MiB != 331 bytes # path_a=file_a; path_b=dir/file_b pass: 1.00 MiB != 376 bytes # path_a=file_a; path_b=dir1/dir2/file_b pass: 1.00 MiB != 420 bytes # path_a=file_a; path_b=dir1/dir2/dir3/file_b pass: 1.00 MiB != 241 bytes # path_a=dir/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/dir2a/file; path_b=dir1b/dir2b/file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/file_a; path_b=dir1b/file_b

see also

https://colabti.org/ircloggy/git/2024-02-24#l704

https://colabti.org/ircloggy/git/2024-02-25#l218

https://colabti.org/ircloggy/git/2024-02-25#l269

I have a strong déjà vu about this also; I think we talked about this exact thing a while ago

this, same: https://colabti.org/ircloggy/git/2023-09-13#l912

https://colabti.org/ircloggy/git/2024-02-25#l404

https://colabti.org/ircloggy/git/2024-02-25#l433

reading that SO answer by jthill but can't quite get the whole picture from it -- is it saying that it's a trade-off in sending all the objects vs. spending resources trying to figure out what to send?

I bet there's an opportunity for optimization here; Git could probably figure out a good balance based on how much data it is about to send

https://stackoverflow.com/questions/48228425/git-push-new-branch-with-same-files-uploads-all-files-again


From: Milan Hauth milahu@gmail.com To: git@vger.kernel.org Subject: git mv fails to deduplicate blob objects on transfer Date: Sun, 25 Feb 2024 19:34:36 +0100 Message-ID: <CAGiEHCub4H7ZCV3CqfFaCRTOhN5A=qy7G_p1pVQw_puyAgjM8w@mail.gmail.com>

milahu commented 6 months ago

same problem when deleting files with git-filter-repo

when i delete files with git-filter-repo then on the next git push --force git fails to deduplicate large blob objects

example

$ du -sh shards/
6.5G    shards/

$ git filter-repo --force --refs main --invert-paths --path shards/95xxxxx/ --path shards/96xxxxx/ --path shards/97xxxxx/

$ du -sh shards/
744M    shards/

$ git push --force
Enumerating objects: 174, done.
Counting objects: 100% (174/174), done.
Delta compression using up to 4 threads
Compressing objects: 100% (42/42), done.
Writing objects: 100% (172/172), 743.50 MiB | 1.86 MiB/s, done.
Total 172 (delta 90), reused 126 (delta 88), pack-reused 0
remote: Resolving deltas: 100% (90/90), completed with 1 local object.
To https://github.com/milahu/opensubtitles-scraper-new-subs
 + cf1a894...082206e main -> main (forced update)

this is wrong, because the remote repo already has all the blob objects git should write only a few bytes for the new tree and commit objects

Writing objects: 100% (172/172), 743.50 MiB | 1.86 MiB/s, done.