Open milahu opened 6 months ago
same problem when deleting files with git-filter-repo
when i delete files with git-filter-repo
then on the next git push --force
git fails to deduplicate large blob objects
example
$ du -sh shards/
6.5G shards/
$ git filter-repo --force --refs main --invert-paths --path shards/95xxxxx/ --path shards/96xxxxx/ --path shards/97xxxxx/
$ du -sh shards/
744M shards/
$ git push --force
Enumerating objects: 174, done.
Counting objects: 100% (174/174), done.
Delta compression using up to 4 threads
Compressing objects: 100% (42/42), done.
Writing objects: 100% (172/172), 743.50 MiB | 1.86 MiB/s, done.
Total 172 (delta 90), reused 126 (delta 88), pack-reused 0
remote: Resolving deltas: 100% (90/90), completed with 1 local object.
To https://github.com/milahu/opensubtitles-scraper-new-subs
+ cf1a894...082206e main -> main (forced update)
this is wrong, because the remote repo already has all the blob objects git should write only a few bytes for the new tree and commit objects
Writing objects: 100% (172/172), 743.50 MiB | 1.86 MiB/s, done.
"git mv" followed by "git push" or "git pull" can produce wasteful transfers of blob objects
these transfers are wasteful because the blob object already exists in the destination repo but "git push" or "git pull" fail to see that
this affects only some cases of "git mv" in some cases, the deduplication works as expected in other cases, dedup fails
this is neglegible for small files, but noticable with large files
in my case, i tried to move 5GB of files (250 x 20MB) and i was surprised as "git push" wanted to transfer 5GB instead of a few bytes for the tree and commit objects
to reproduce: see repro.sh in https://github.com/milahu/git-bug-git-mv-wasteful-transfer
output of repro.sh
the first size is the transfer size before "git mv" the second size is the transfer size after "git mv"
pass: 1.00 MiB != 288 bytes # path_a=file_a; path_b=file_b pass: 1.00 MiB != 331 bytes # path_a=dir/file_a; path_b=dir/file_b pass: 1.00 MiB != 286 bytes # path_a=dir_a/file; path_b=dir_b/file pass: 1.00 MiB != 284 bytes # path_a=file; path_b=dir/file pass: 1.00 MiB != 329 bytes # path_a=file; path_b=dir1/dir2/file pass: 1.00 MiB != 373 bytes # path_a=file; path_b=dir1/dir2/dir3/file pass: 1.00 MiB != 331 bytes # path_a=file_a; path_b=dir/file_b pass: 1.00 MiB != 376 bytes # path_a=file_a; path_b=dir1/dir2/file_b pass: 1.00 MiB != 420 bytes # path_a=file_a; path_b=dir1/dir2/dir3/file_b pass: 1.00 MiB != 241 bytes # path_a=dir/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file; path_b=file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1/dir2/dir3/file_a; path_b=file_b FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/dir2a/file; path_b=dir1b/dir2b/file FAIL: 1.00 MiB == 1.00 MiB # path_a=dir1a/file_a; path_b=dir1b/file_b
see also
https://colabti.org/ircloggy/git/2024-02-24#l704
https://colabti.org/ircloggy/git/2024-02-25#l218
https://colabti.org/ircloggy/git/2024-02-25#l269
https://colabti.org/ircloggy/git/2024-02-25#l404
https://colabti.org/ircloggy/git/2024-02-25#l433
https://stackoverflow.com/questions/48228425/git-push-new-branch-with-same-files-uploads-all-files-again
From: Milan Hauth milahu@gmail.com To: git@vger.kernel.org Subject: git mv fails to deduplicate blob objects on transfer Date: Sun, 25 Feb 2024 19:34:36 +0100 Message-ID:
<CAGiEHCub4H7ZCV3CqfFaCRTOhN5A=qy7G_p1pVQw_puyAgjM8w@mail.gmail.com>