gitgitgadget / git

GitGitGadget's Git fork. Open Pull Requests here to submit them to the Git mailing list
https://gitgitgadget.github.io/
Other
208 stars 135 forks source link

Fix background maintenance regression in Git 2.45.0 #1764

Closed derrickstolee closed 1 month ago

derrickstolee commented 1 month ago

Here is an issue I noticed while exploring issues with my local copy of a large monorepo. I was intending to show some engineers how nice the objects were maintained by background maintenance, but saw hundreds of small pack-files that were up to two months old. This time matched when I upgraded to the microsoft/git fork that included the 2.45.0 release of Git.

The issue is that 'git multi-pack-index repack' was taught to call 'git pack-objects' with the new '--stdin-packs' option. However, this changes the object selection algorithm. Instead of using the objects referenced by the multi-pack-index, it compares pack-files using a list of "included" and "excluded" pack-files. This loses some granularity of how the multi-pack-index chooses among duplicate objects.

The end result is that some objects that would normally have been included in the new pack-file are no longer included. The copy that the multi-pack-index references is in the pack-file that was intended to be repacked, so that pack-file cannot be expired in the next 'git multi-pack-index expire' step and is included again in the batch of objects to repack.

In the context of the change that is reverted by this series, it seems the motivation of the change was two-fold:

  1. some I/O benefits to using pack names over object names, and
  2. the ability to use an object walk to improve delta compression.

In my local prototyping, I've found that we could improve 'git pack-objects' to use an object walk when given a set of objects over stdin without needing to use pack-file names. I do not believe the '--stdin-packs' option should be used for the 'git multi-pack-index repack' mechanism, or at least should be done with great care and only in specific cases where some assumptions can be made around duplicate objects and closure under reachability.

However, the prototype I've built to get these benefits is non-trivial due to working to guarantee that partial clones do not accidentally download missing blobs. That will follow in a separate series that can be reviewed at a slower pace.

Thanks, -Stolee

cc: gitster@pobox.com cc: Taylor Blau me@ttaylorr.com

derrickstolee commented 1 month ago

/submit

gitgitgadget[bot] commented 1 month ago

Submitted as pull.1764.git.1721332546.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1764/derrickstolee/incremental-repack-fix-v1

To fetch this version to local tag pr-1764/derrickstolee/incremental-repack-fix-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1764/derrickstolee/incremental-repack-fix-v1
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Here is an issue I noticed while exploring issues with my local copy of a
> large monorepo. I was intending to show some engineers how nice the objects
> were maintained by background maintenance, but saw hundreds of small
> pack-files that were up to two months old. This time matched when I upgraded
> to the microsoft/git fork that included the 2.45.0 release of Git.

I almost said "wow, perfect timing on the -rc1 day", but then
realized that this is not a regression during _this_ cycle, but a
cycle ago.

You already Cc'ed Taylor, whom I would have asked for help if I were
the one who found this issue, which is good.

Will queue.

Thanks.
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote:
> You already Cc'ed Taylor, whom I would have asked for help if I were
> the one who found this issue, which is good.

It looks like there is a small typo in my email address as
"me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message
anyway ;-).

Thanks,
Taylor
gitgitgadget[bot] commented 1 month ago

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> The issue is that 'git multi-pack-index repack' was taught to call 'git
> pack-objects' with the new '--stdin-packs' option. However, this changes the
> object selection algorithm. Instead of using the objects referenced by the
> multi-pack-index, it compares pack-files using a list of "included" and
> "excluded" pack-files. This loses some granularity of how the
> multi-pack-index chooses among duplicate objects.

Thank you for looking at this so carefully! Let me double check my own
understanding.

Suppose we have a MIDX with some pack 'P' and say, some commit object
'C' which appears in that pack. Let's also suppose we have another pack
'Q' in the same MIDX which also contains 'C', but the MIDX selected its
copy from pack 'P'.

If we want to combine 'P' with some other packs (excluding 'Q'), then
the input to --stdin-packs will look something like:

    P.pack
    ^Q.pack
    ...

And the resulting pack would not contain 'C', since we would reject it
via: add_object_entry_from_pack() -> want_object_in_pack() ->
want_found_object() -> has_object_kept_pack(). The final function there
would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would
reject 'C' as unwanted.

So the resulting pack would not contain 'C', and the MIDX would hold
onto its copy from pack 'P', resulting in 'P' being both (a) in the set
of packs to repack together, but also (b) non-expireable, since it has
at least one object selected from it in the MIDX.

> The end result is that some objects that would normally have been included
> in the new pack-file are no longer included. The copy that the
> multi-pack-index references is in the pack-file that was intended to be
> repacked, so that pack-file cannot be expired in the next 'git
> multi-pack-index expire' step and is included again in the batch of objects
> to repack.

I think this matches my own understanding, but let me know if I'm
missing something. Assuming I'm thinking about this the same way you
are, the fix (stop using --stdin-packss) makes sense to me.

Thanks,
Taylor
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/18/24 6:50 PM, Taylor Blau wrote:
> On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote:
>> The issue is that 'git multi-pack-index repack' was taught to call 'git
>> pack-objects' with the new '--stdin-packs' option. However, this changes the
>> object selection algorithm. Instead of using the objects referenced by the
>> multi-pack-index, it compares pack-files using a list of "included" and
>> "excluded" pack-files. This loses some granularity of how the
>> multi-pack-index chooses among duplicate objects.
> > Thank you for looking at this so carefully! Let me double check my own
> understanding.
> > Suppose we have a MIDX with some pack 'P' and say, some commit object
> 'C' which appears in that pack. Let's also suppose we have another pack
> 'Q' in the same MIDX which also contains 'C', but the MIDX selected its
> copy from pack 'P'.
> > If we want to combine 'P' with some other packs (excluding 'Q'), then
> the input to --stdin-packs will look something like:
> >      P.pack
>      ^Q.pack
>      ...
> > And the resulting pack would not contain 'C', since we would reject it
> via: add_object_entry_from_pack() -> want_object_in_pack() ->
> want_found_object() -> has_object_kept_pack(). The final function there
> would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would
> reject 'C' as unwanted.
> > So the resulting pack would not contain 'C', and the MIDX would hold
> onto its copy from pack 'P', resulting in 'P' being both (a) in the set
> of packs to repack together, but also (b) non-expireable, since it has
> at least one object selected from it in the MIDX.
> >> The end result is that some objects that would normally have been included
>> in the new pack-file are no longer included. The copy that the
>> multi-pack-index references is in the pack-file that was intended to be
>> repacked, so that pack-file cannot be expired in the next 'git
>> multi-pack-index expire' step and is included again in the batch of objects
>> to repack.
> > I think this matches my own understanding, but let me know if I'm
> missing something. Assuming I'm thinking about this the same way you
> are, the fix (stop using --stdin-packss) makes sense to me.

Your interpretation matches mine. Thanks for the careful read.

I think we can accomplish similar goals that match the reasoning for
--stdin-packs (better deltas while also limiting the object walk to the repacked objects) with some changes to read_object_list_from_stdin(),
but that's a more subtle kind of change.

Taking this change as-is will cause a regression in the quality of
delta choices, but recovers our ability to expire "repacked" pack-files
in all cases.

Thanks,
-Stolee
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/18/24 6:38 PM, Taylor Blau wrote:
> On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote:
>> You already Cc'ed Taylor, whom I would have asked for help if I were
>> the one who found this issue, which is good.
> > It looks like there is a small typo in my email address as
> "me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message
> anyway ;-).

Whoops! Sorry about that. Repeated too many letters. I've fixed the
GitGitGadget cover letter in case a v2 is required.

Thanks,
-Stolee
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/18/24 5:57 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >> Here is an issue I noticed while exploring issues with my local copy of a
>> large monorepo. I was intending to show some engineers how nice the objects
>> were maintained by background maintenance, but saw hundreds of small
>> pack-files that were up to two months old. This time matched when I upgraded
>> to the microsoft/git fork that included the 2.45.0 release of Git.
> > I almost said "wow, perfect timing on the -rc1 day", but then
> realized that this is not a regression during _this_ cycle, but a
> cycle ago.

I almost waited until after the release, but I wanted to put the
information out there just in case you were interested in taking it
into 2.46.0 or were planning on a 2.45.3.

Thanks,
-Stolee
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jul 19, 2024 at 09:21:51AM -0400, Derrick Stolee wrote:
> On 7/18/24 6:50 PM, Taylor Blau wrote:
>
> > I think this matches my own understanding, but let me know if I'm
> > missing something. Assuming I'm thinking about this the same way you
> > are, the fix (stop using --stdin-packss) makes sense to me.
>
> Your interpretation matches mine. Thanks for the careful read.
>
> I think we can accomplish similar goals that match the reasoning for
> --stdin-packs (better deltas while also limiting the object walk to the
> repacked objects) with some changes to read_object_list_from_stdin(),
> but that's a more subtle kind of change.

FWIW, the main motivation for that change was to limit the amount of
cross-process I/O that was necessary to generate the new pack. I figured
that for relatively small amounts of packs which contain relatively
large amounts of objects that it would be more efficient to write out
the pack names than the object names.

I was thinking a little bit about how we would alter the behavior of
'--stdin-packs' to match what the 'multi-pack-index repack' caller
needs. I agree that it is possible, and I doubly agree that it is subtle
;-).

TBH, I think that the amount of I/O we're potentially saving is dwarfed
by the amount of I/O and CPU time it takes to actually generate the new
pack, so I doubt the effort to make such a subtle change would be all
that worthwhile, though certainly an interesting exercise ;-).

> Taking this change as-is will cause a regression in the quality of
> delta choices, but recovers our ability to expire "repacked" pack-files
> in all cases.

Yeah, definitely.

Thanks,
Taylor
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> On 7/18/24 5:57 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> Here is an issue I noticed while exploring issues with my local copy of a
>>> large monorepo. I was intending to show some engineers how nice the objects
>>> were maintained by background maintenance, but saw hundreds of small
>>> pack-files that were up to two months old. This time matched when I upgraded
>>> to the microsoft/git fork that included the 2.45.0 release of Git.
>> I almost said "wow, perfect timing on the -rc1 day", but then
>> realized that this is not a regression during _this_ cycle, but a
>> cycle ago.
>
> I almost waited until after the release, but I wanted to put the
> information out there just in case you were interested in taking it
> into 2.46.0 or were planning on a 2.45.3.

Yup, thanks but this is not exactly a repository breaking data
corruption bug, and did not look ultra urgent.  Especially if we
want to pursue a solution that helps both expiring stale packs
better (which is what you are restoring) and making better delta
chain selection (which may be what you are losing) at the same time,
such a change could become a source of data corruption bug, so I'd
prefer to see it started early in a cycle, rather as a last-minute
"let's fix this too".

Thanks.
gitgitgadget[bot] commented 1 month ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/19/24 11:13 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> >> On 7/18/24 5:57 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> Here is an issue I noticed while exploring issues with my local copy of a
>>>> large monorepo. I was intending to show some engineers how nice the objects
>>>> were maintained by background maintenance, but saw hundreds of small
>>>> pack-files that were up to two months old. This time matched when I upgraded
>>>> to the microsoft/git fork that included the 2.45.0 release of Git.
>>> I almost said "wow, perfect timing on the -rc1 day", but then
>>> realized that this is not a regression during _this_ cycle, but a
>>> cycle ago.
>>
>> I almost waited until after the release, but I wanted to put the
>> information out there just in case you were interested in taking it
>> into 2.46.0 or were planning on a 2.45.3.
> > Yup, thanks but this is not exactly a repository breaking data
> corruption bug, and did not look ultra urgent.  Especially if we
> want to pursue a solution that helps both expiring stale packs
> better (which is what you are restoring) and making better delta
> chain selection (which may be what you are losing) at the same time,
> such a change could become a source of data corruption bug, so I'd
> prefer to see it started early in a cycle, rather as a last-minute
> "let's fix this too".

I agree with this assessment. The microsoft/git fork has taken the
revert as users of that fork have a higher chance of being affected.

Thanks,
-Stolee
gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/f9f80cf06a0a8ed8dde38f3c3c705b7a8b93ee0e.

gitgitgadget[bot] commented 1 month ago

This branch is now known as ds/midx-write-repack-fix.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/b86f6af6d05cd225013ba4b4f3b5c52b721cdc6e.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/7add94b3bb4ad57da562558bc1ce7f35f2e93cab.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into next via https://github.com/git/git/commit/0a64b3ed4263708d0d5382762def2b01798d0dee.

gitgitgadget[bot] commented 1 month ago

There was a status update in the "New Topics" section about the branch ds/midx-write-repack-fix on the Git mailing list:

Repacking a repository with multi-pack index started making stupid
pack selections in Git 2.45, which has been corrected.

Will cook in 'next'.
source: <pull.1764.git.1721332546.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/8f980981c0c5c48bfa5d5c28f014222202ef4366.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/d144949f5ea1cb57686e7961c816f56dcf408310.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into seen via https://github.com/git/git/commit/ec9d46588ec063685b4abfbe13dfd9da39c77d7c.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into master via https://github.com/git/git/commit/ec9d46588ec063685b4abfbe13dfd9da39c77d7c.

gitgitgadget[bot] commented 1 month ago

This patch series was integrated into next via https://github.com/git/git/commit/ec9d46588ec063685b4abfbe13dfd9da39c77d7c.

gitgitgadget[bot] commented 1 month ago

Closed via ec9d46588ec063685b4abfbe13dfd9da39c77d7c.