gitgitgadget / git

GitGitGadget's Git fork. Open Pull Requests here to submit them to the Git mailing list
https://gitgitgadget.github.io/
Other
221 stars 133 forks source link

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

Open derrickstolee opened 1 month ago

derrickstolee commented 1 month ago

WIP

Thanks, -Stolee

Here is the range-diff from the equivalent patches in the previous version: ``` 1: 3fdb57edbc5 ! 1: 6365d8148c4 pack-objects: extract should_attempt_deltas() @@ Metadata ## Commit message ## pack-objects: extract should_attempt_deltas() - This will be helpful in a future change that introduces a new way to - compute deltas. - - Be careful to preserve the nr_deltas counting logic in the existing - method, but take the rest of the logic wholesale. + This will be helpful in a future change. Signed-off-by: Derrick Stolee 2: a0475c7cba8 = 2: 0a4c2c2da0e pack-objects: add --path-walk option 3: 73c8b61e87b = 3: df74fe9a35b pack-objects: update usage to match docs 4: 21dc3723c36 ! 4: 4e08947ea0a p5313: add performance tests for --path-walk @@ Commit message Running on my copy of the Git repository results in this data: - Test this tree - --------------------------------------------------------- - 5313.2: thin pack 0.01(0.00+0.00) - 5313.3: thin pack size 1.1K - 5313.4: thin pack with --path-walk 0.01(0.01+0.00) - 5313.5: thin pack size with --path-walk 1.1K - 5313.6: big pack 2.52(6.59+0.38) - 5313.7: big pack size 14.1M - 5313.8: big pack with --path-walk 4.90(5.76+0.26) + Test HEAD + -------------------------------------------------------------- + 5313.2: thin pack 0.00(0.00+0.00) + 5313.3: thin pack size 589 + 5313.4: thin pack with --path-walk 0.00(0.00+0.00) + 5313.5: thin pack size with --path-walk 589 + 5313.6: big pack 2.76(7.19+0.27) + 5313.7: big pack size 14.0M + 5313.8: big pack with --path-walk 5.76(6.72+0.16) 5313.9: big pack size with --path-walk 13.2M Note that the timing is slower because there is no threading in the @@ Commit message Running the tests on this repo results in the following output: - Test this tree - ---------------------------------------------------------- - 5313.2: thin pack 0.28(0.38+0.02) + Test HEAD + -------------------------------------------------------------- + 5313.2: thin pack 0.28(0.40+0.03) 5313.3: thin pack size 1.2M - 5313.4: thin pack with --path-walk 0.08(0.06+0.01) + 5313.4: thin pack with --path-walk 0.07(0.06+0.00) 5313.5: thin pack size with --path-walk 18.4K - 5313.6: big pack 4.05(29.62+0.43) - 5313.7: big pack size 20.0M - 5313.8: big pack with --path-walk 5.99(9.06+0.24) - 5313.9: big pack size with --path-walk 16.4M + 5313.6: big pack 4.05(29.48+0.41) + 5313.7: big pack size 19.7M + 5313.8: big pack with --path-walk 6.01(9.17+0.20) + 5313.9: big pack size with --path-walk 16.5M Notice in particular that in the small thin pack, the time performance has improved from 0.28s to 0.08s and this is likely due to the improved @@ Commit message Finally, running this on a copy of the Linux kernel repository results in these data points: - Test this tree - ----------------------------------------------------------- - 5313.2: thin pack 0.00(0.00+0.00) - 5313.3: thin pack size 5.8K - 5313.4: thin pack with --path-walk 0.00(0.01+0.00) - 5313.5: thin pack size with --path-walk 5.8K - 5313.6: big pack 24.39(65.81+1.31) - 5313.7: big pack size 155.7M - 5313.8: big pack with --path-walk 41.07(60.69+0.68) - 5313.9: big pack size with --path-walk 150.8M + Test HEAD + -------------------------------------------------------------- + 5313.2: thin pack 0.03(0.02+0.00) + 5313.3: thin pack size 4.6K + 5313.4: thin pack with --path-walk 0.03(0.01+0.01) + 5313.5: thin pack size with --path-walk 4.6K + 5313.6: big pack 21.06(60.57+1.45) + 5313.7: big pack size 158.3M + 5313.8: big pack with --path-walk 37.65(57.83+0.67) + 5313.9: big pack size with --path-walk 152.3M Signed-off-by: Derrick Stolee 5: 6f96b1c227a = 5: 9235f9bb9c8 pack-objects: introduce GIT_TEST_PACK_PATH_WALK -: ----------- > 6: 201b2210712 t5538: add tests to confirm deltas in shallow pushes 6: 834c9ea2709 ! 7: 5caf28ec6c7 repack: add --path-walk option @@ Commit message through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. - In my copy of the Git repository, the new tests in p5313 show these - results: + Add the --path-walk option to the performance tests in p5313. + + For the microsoft/fluentui repo [1] checked out at a specific commit [2], + the results are very interesting: + + Test HEAD + ---------------------------------------------------------- + 5313.2: thin pack 0.41(0.48+0.03) + 5313.3: thin pack size 1.2M + 5313.4: thin pack with --path-walk 0.08(0.05+0.02) + 5313.5: thin pack size with --path-walk 18.4K + 5313.6: big pack 4.47(30.62+0.40) + 5313.7: big pack size 19.6M + 5313.8: big pack with --path-walk 6.76(9.87+0.23) + 5313.9: big pack size with --path-walk 16.5M + 5313.10: repack 96.87(664.29+2.75) + 5313.11: repack size 439.5M + 5313.12: repack with --path-walk 95.68(109.90+0.92) + 5313.13: repack size with --path-walk 122.6M - Test this tree - ------------------------------------------------------------- - 5313.10: repack 27.88(150.23+2.70) - 5313.11: repack size 228.2M - 5313.12: repack with --path-walk 134.59(148.77+0.81) - 5313.13: repack size with --path-walk 209.7M + [1] https://github.com/microsoft/fluentui + [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 - Note that the 'git pack-objects --path-walk' feature is not integrated - with threads. Look forward to a future change that will introduce - threading to improve the time performance of this feature with - equivalent space performance. + This repo suffers from having a lot of paths that collide in the name + hash, so examining them in groups by path leads to better deltas. Also, + in this case, the single-threaded implementation is competitive with the + full repack. This is saving time diffing files that have significant + differences from each other. - For the microsoft/fluentui repo [1] had some interesting aspects for the - previous tests in p5313, so here are the repack results: + A similar, but private, repo has even more extremes during repacking: - Test this tree - ------------------------------------------------------------- - 5313.10: repack 91.76(680.94+2.48) - 5313.11: repack size 439.1M - 5313.12: repack with --path-walk 110.35(130.46+0.74) - 5313.13: repack size with --path-walk 155.3M + Test HEAD + ----------------------------------------------------------------- + 5313.10: repack 2138.22(11961.00+17.67) + 5313.11: repack size 6.4G + 5313.12: repack with --path-walk 1351.46(1418.28+3.96) + 5313.13: repack size with --path-walk 804.1M - [1] https://github.com/microsoft/fluentui + There are small benefits in size for my copy of the Git repository: - Here, we see the significant improvement of a full repack using this - strategy. The name-hash collisions in this repo cause the space - problems. Those collisions also cause the repack command to spend a lot - of cycles trying to find delta bases among files that are not actually - very similar, so the lack of threading with the --path-walk feature is - less pronounced in the process time. + Test HEAD + ----------------------------------------------------------- + 5313.10: repack 22.11(98.37+1.64) + 5313.11: repack size 126.4M + 5313.12: repack with --path-walk 66.89(75.61+0.58) + 5313.13: repack size with --path-walk 109.6M - For the Linux kernel repository, we have these stats: + As well as in the nodejs/node repository [3]: - Test this tree + Test HEAD + ------------------------------------------------------------- + 5313.2: thin pack 0.01(0.01+0.00) + 5313.3: thin pack size 1.6K + 5313.4: thin pack with --path-walk 0.02(0.01+0.00) + 5313.5: thin pack size with --path-walk 1.6K + 5313.6: big pack 5.35(12.43+0.32) + 5313.7: big pack size 52.2M + 5313.8: big pack with --path-walk 7.12(11.97+0.27) + 5313.9: big pack size with --path-walk 52.1M + 5313.10: repack 87.74(342.90+4.24) + 5313.11: repack size 739.7M + 5313.12: repack with --path-walk 212.79(245.05+1.78) + 5313.13: repack size with --path-walk 697.6M + + [3] https://github.com/nodejs/node + + This benefit also repeats in this instance in the Linux kernel repository: + + Test HEAD --------------------------------------------------------------- - 5313.10: repack 553.61(1929.41+30.31) + 5313.2: thin pack 0.04(0.00+0.03) + 5313.3: thin pack size 4.6K + 5313.4: thin pack with --path-walk 0.03(0.01+0.01) + 5313.5: thin pack size with --path-walk 4.6K + 5313.6: big pack 21.16(62.81+1.35) + 5313.7: big pack size 158.3M + 5313.8: big pack with --path-walk 36.09(55.25+0.67) + 5313.9: big pack size with --path-walk 152.2M + 5313.10: repack 734.26(2149.62+31.24) 5313.11: repack size 2.5G - 5313.12: repack with --path-walk 1777.63(2044.16+7.47) - 5313.13: repack size with --path-walk 2.5G - - This demonstrates that the --path-walk feature does not always present - measurable improvements, especially in cases where the name-hash has - very few collisions. + 5313.12: repack with --path-walk 1457.23(1618.15+7.00) + 5313.13: repack size with --path-walk 2.2G + + It is important to see that even when the repository shape does not have + many name-hash collisions, there is a slight space boost to be found + using this method. Also, there is no known case where the space is + worse with --path-walk. This is of course due to the second pass where + all objects to be packed are sorted in the usual way and checked for + deltas. This second pass is usually very fast as the path-walk has + primed many objects with quality deltas that short-circuit other delta + computation attempts. Signed-off-by: Derrick Stolee @@ builtin/repack.c: static int run_update_server_info = 1; - N_("git repack []"), + N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n" + "[--window=] [--depth=] [--threads=] [--keep-pack=]\n" -+ "[--write-midx] [--full-path-walk]"), ++ "[--write-midx] [--path-walk]"), NULL }; 7: 6ef8d67af4b ! 8: 8bfe5116178 repack: update usage to match docs @@ Commit message Signed-off-by: Derrick Stolee - ## builtin/repack.c ## -@@ builtin/repack.c: static char *packdir, *packtmp_name, *packtmp; - static const char *const git_repack_usage[] = { - N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n" - "[--window=] [--depth=] [--threads=] [--keep-pack=]\n" -- "[--write-midx] [--full-path-walk]"), -+ "[--write-midx] [--path-walk]"), - NULL - }; - - ## t/t0450/txt-help-mismatches ## @@ t/t0450/txt-help-mismatches: rebase remote 8: 1db90e361ba = 9: 6b9ca51bc7c pack-objects: enable --path-walk via config 9: 0f3040b4b90 = 10: 3f05c89b9cc scalar: enable path-walk during push via config 10: 030d8ec238e = 11: 37572d70deb pack-objects: refactor path-walk delta phase 11: fddc320eb0b ! 12: 7ae9a40f346 pack-objects: thread the path-based compression @@ Commit message objects being packed. (This was tested on a 16-core machine.) Test HEAD~1 HEAD - --------------------------------------------------------------- - 5313.2: thin pack 0.01 0.01 +0.0% - 5313.4: thin pack with --path-walk 0.01 0.01 +0.0% - 5313.6: big pack 2.54 2.60 +2.4% - 5313.8: big pack with --path-walk 4.70 3.09 -34.3% - 5313.10: repack 28.75 28.55 -0.7% - 5313.12: repack with --path-walk 108.55 46.14 -57.5% + ----------------------------------------------------------------- + 5313.2: thin pack 0.00 0.00 = + 5313.3: thin pack size 589 589 +0.0% + 5313.4: thin pack with --path-walk 0.00 0.00 = + 5313.5: thin pack size with --path-walk 589 589 +0.0% + 5313.6: big pack 2.84 2.80 -1.4% + 5313.7: big pack size 14.0M 14.1M +0.3% + 5313.8: big pack with --path-walk 5.46 3.77 -31.0% + 5313.9: big pack size with --path-walk 13.2M 13.2M -0.0% + 5313.10: repack 22.11 21.50 -2.8% + 5313.11: repack size 126.4M 126.2M -0.2% + 5313.12: repack with --path-walk 66.89 26.41 -60.5% + 5313.13: repack size with --path-walk 109.6M 109.6M +0.0% - On the microsoft/fluentui repo, where the --path-walk feature has been - shown to be more effective in space savings, we get these results: + This 60% reduction in 'git repack --path-walk' time is typical across + all repos I used for testing. What is interesting is to compare when the + overall time improves enough to outperform the standard case. These time + improvements correlate with repositories with data shapes that + significantly improve their data size as well. - Test HEAD~1 HEAD + For example, the microsoft/fluentui repo has a 439M to 122M size + reduction, and the repack time is now 36.6 seconds with --path-walk + compared to 95+ seconds without it: + + Test HEAD~! HEAD + ----------------------------------------------------------------- + 5313.2: thin pack 0.41 0.42 +2.4% + 5313.3: thin pack size 1.2M 1.2M +0.0% + 5313.4: thin pack with --path-walk 0.08 0.05 -37.5% + 5313.5: thin pack size with --path-walk 18.4K 18.4K +0.0% + 5313.6: big pack 4.47 4.53 +1.3% + 5313.7: big pack size 19.6M 19.7M +0.3% + 5313.8: big pack with --path-walk 6.76 3.51 -48.1% + 5313.9: big pack size with --path-walk 16.5M 16.4M -0.2% + 5313.10: repack 96.87 99.05 +2.3% + 5313.11: repack size 439.5M 439.0M -0.1% + 5313.12: repack with --path-walk 95.68 36.55 -61.8% + 5313.13: repack size with --path-walk 122.6M 122.6M +0.0% + + In a more extreme example, an internal repository that has a similar + name-hash collision issue to microsoft/fluentui reduces its size from + 6.4G to 805M with the --path-walk option. This also reduces the + repacking time from 2,138 seconds to 478 seconds. + + Test HEAD~1 HEAD + ------------------------------------------------------------------ + 5313.10: repack 2138.22 2138.19 -0.0% + 5313.11: repack size 6.4G 6.4G -0.0% + 5313.12: repack with --path-walk 1351.46 477.91 -64.6% + 5313.13: repack size with --path-walk 804.1M 804.1M -0.0% + + Finally, the Linux kernel repository is a good test for this repacking + time change, even though the space savings is more reasonable: + + Test HEAD~1 HEAD ---------------------------------------------------------------- - 5313.2: thin pack 0.39 0.40 +2.6% - 5313.4: thin pack with --path-walk 0.08 0.07 -12.5% - 5313.6: big pack 4.15 4.15 +0.0% - 5313.8: big pack with --path-walk 6.41 3.21 -49.9% - 5313.10: repack 90.69 90.83 +0.2% - 5313.12: repack with --path-walk 108.23 49.09 -54.6% + 5313.10: repack 734.26 735.11 +0.1% + 5313.11: repack size 2.5G 2.5G -0.0% + 5313.12: repack with --path-walk 1457.23 598.17 -59.0% + 5313.13: repack size with --path-walk 2.2G 2.2G +0.0% Signed-off-by: Derrick Stolee ```