gitgitgadget / git

GitGitGadget's Git fork. Open Pull Requests here to submit them to the Git mailing list
https://gitgitgadget.github.io/
Other
208 stars 135 forks source link

object checking related additions and fixes for bundles in fetches #1730

Closed blanet closed 2 months ago

blanet commented 3 months ago

While attempting to fix a reference negotiation bug in bundle-uri, we identified that the fetch process lacks some crucial object validation checks when processing bundles. The primary issues are:

  1. In the bundle-uri scenario, object IDs were not validated before writing bundle references. This was the root cause of the original negotiation bug in bundle-uri and could lead to potential repository corruption.
  2. The existing "fetch.fsckObjects" and "transfer.fsckObjects" configurations were not applied when directly fetching bundles or fetching with bundle-uri enabled. In fact, there were no object validation supports for unbundle.

The first patch addresses the bundle-uri negotiation issue by removing the REF_SKIP_OID_VERIFICATION flag when writing bundle references.

Patches 2 through 3 extend verify_bundle_flags for bundle.c:unbundle to add support for object validation (fsck) in fetch scenarios, mainly following the suggestions from Junio and Patrick on the mailing list.

cc: Patrick Steinhardt ps@pks.im cc: Karthik Nayak karthik.188@gmail.com

blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.git.1715741156686.gitgitgadget@gmail.com

blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.git.1715742069966.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, May 15, 2024 at 03:01:09AM +0000, blanet via GitGitGadget wrote:
> From: Xing Xin <xingxin.xx@bytedance.com>

Long time no see :)

> When using the bundle-uri mechanism with a bundle list containing
> multiple interrelated bundles, we encountered a bug where tips from
> downloaded bundles were not being discovered, resulting in rather slow
> clones. This was particularly problematic when employing the heuristic
> `creationTokens`.
> 
> And this is easy to reproduce. Suppose we have a repository with a
> single branch `main` pointing to commit `A`, firstly we create a base
> bundle with
> 
>   git bundle create base.bundle main
> 
> Then let's add a new commit `B` on top of `A`, so that an incremental
> bundle for `main` can be created with
> 
>   git bundle create incr.bundle A..main
> 
> Now we can generate a bundle list with the following content:
> 
>   [bundle]
>       version = 1
>       mode = all
>       heuristic = creationToken
> 
>   [bundle "base"]
>       uri = base.bundle
>       creationToken = 1
> 
>   [bundle "incr"]
>       uri = incr.bundle
>       creationToken = 2
> 
> A fresh clone with the bundle list above would give the expected
> `refs/bundles/main` pointing at `B` in new repository, in other words we
> already had everything locally from the bundles, but git would still
> download everything from server as if we got nothing.
> 
> So why the `refs/bundles/main` is not discovered? After some digging I
> found that:
> 
> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>    check its prerequisites if any. The verify procedure would find oids
>    so `packed_git` is initialized.
> 
> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>    during which `mark_complete_and_common_ref` and `mark_tips` would
>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>    enlisted if `packed_git` has already initialized in 1.

And I assume we do not want it to not use `OBJECT_INFO_QUICK`?

> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
> can not find `B` for negotiation at a latter time bacause `B` exists in
> `incr.pack` which is not enlisted in `packed_git`.

Okay, the explanation feels sensible.

> This commit fixes this by adding a `reprepare_packed_git` call for every
> successfully unbundled bundle, which ensures to enlist all generated
> packs from bundle uri. And a set of negotiation related tests are added.

This makes me wonder though. Do we really need to call
`reprepare_packed_git()` once for every bundle, or can't we instead call
it once at the end once we have fetched all bundles? Reading on.

> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
> ---
>     bundle-uri: refresh packed_git if unbundle succeed
> 
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
> 
>  bundle-uri.c                |   3 +
>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+)
> 
> diff --git a/bundle-uri.c b/bundle-uri.c
> index ca32050a78f..2b9d36cfd8e 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -7,6 +7,7 @@
>  #include "refs.h"
>  #include "run-command.h"
>  #include "hashmap.h"
> +#include "packfile.h"
>  #include "pkt-line.h"
>  #include "config.h"
>  #include "remote.h"
> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>                  VERIFY_BUNDLE_QUIET)))
>       return 1;
>  
> + reprepare_packed_git(r);
> +

So what's hidden here is that `unbundle_from_file()` will also try to
access the bundle's refs right away. Surprisingly, we do so by calling
`refs_update_ref()` with `REF_SKIP_OID_VERIFICATION`, which has the
effect that we basically accept arbitrary object IDs here even if we do
not know those. That's why we didn't have to `reprepare_packed_git()`
before this change.

Now there are two conflicting thoughts here:

  - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
    should now be accessible.

  - Or we can avoid calling `reprepare_packed_git()` inside the loop and
    instead call it once after we have fetched all bundles.

The second one feels a bit like premature optimization to me. But the
first item does feel like it could help us to catch broken bundles
because we wouldn't end up creating refs for objects that neither we nor
the bundle have.

Patrick
gitgitgadget[bot] commented 3 months ago

User Patrick Steinhardt <ps@pks.im> has been added to the cc: list.

gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Karthik Nayak wrote (reply to this):

"blanet via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Xing Xin <xingxin.xx@bytedance.com>>
> When using the bundle-uri mechanism with a bundle list containing
> multiple interrelated bundles, we encountered a bug where tips from
> downloaded bundles were not being discovered, resulting in rather slow
> clones. This was particularly problematic when employing the heuristic
> `creationTokens`.
>
> And this is easy to reproduce. Suppose we have a repository with a
> single branch `main` pointing to commit `A`, firstly we create a base
> bundle with
>
>   git bundle create base.bundle main
>
> Then let's add a new commit `B` on top of `A`, so that an incremental
> bundle for `main` can be created with
>
>   git bundle create incr.bundle A..main
>
> Now we can generate a bundle list with the following content:
>
>   [bundle]
>       version = 1
>       mode = all
>       heuristic = creationToken
>
>   [bundle "base"]
>       uri = base.bundle
>       creationToken = 1
>
>   [bundle "incr"]
>       uri = incr.bundle
>       creationToken = 2
>
> A fresh clone with the bundle list above would give the expected
> `refs/bundles/main` pointing at `B` in new repository, in other words we
> already had everything locally from the bundles, but git would still
> download everything from server as if we got nothing.
>
> So why the `refs/bundles/main` is not discovered? After some digging I
> found that:
>
> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to

s/a//

>    check its prerequisites if any. The verify procedure would find oids
>    so `packed_git` is initialized.
>

So the flow is:
1. `fetch_bundle_list` fetches all the bundles advertised via
`download_bundle_list` to local files.
2. It then calls `unbundle_all_bundles` to unbundle all the bundles.
3. Each bundle is then unbundled using `unbundle_from_file`.
4. Here, we first read the bundle header to get all the prerequisites
for the bundle, this is done in `read_bundle_header`.
5. Then we call `unbundle`, which calls `verify_bundle` to ensure that
the repository does indeed contain the prerequisites mentioned in the
bundle. Then it creates the index from the bundle file.

So because the objects are being checked, the `prepare_packed_git`
function is eventually called, which means that the
`raw_object_store->packed_git` data gets filled in and
`packed_git_initialized` is set.

This means consecutive calls to `prepare_packed_git` doesn't
re-initiate `raw_object_store->packed_git` since
`packed_git_initialized` already is set.

So your explanation makes sense, as a _nit_ I would perhaps add the part
about why consecutive calls to `prepare_packed_git` are ineffective.

> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,

s/unbundled/unbundling

>    during which `mark_complete_and_common_ref` and `mark_tips` would
>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>    enlisted if `packed_git` has already initialized in 1.
> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
> can not find `B` for negotiation at a latter time bacause `B` exists in
> `incr.pack` which is not enlisted in `packed_git`.
>
> This commit fixes this by adding a `reprepare_packed_git` call for every
> successfully unbundled bundle, which ensures to enlist all generated
> packs from bundle uri. And a set of negotiation related tests are added.
>

The solution makes sense.

> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
> ---
>     bundle-uri: refresh packed_git if unbundle succeed
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
>
>  bundle-uri.c                |   3 +
>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+)
>
> diff --git a/bundle-uri.c b/bundle-uri.c
> index ca32050a78f..2b9d36cfd8e 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -7,6 +7,7 @@
>  #include "refs.h"
>  #include "run-command.h"
>  #include "hashmap.h"
> +#include "packfile.h"
>  #include "pkt-line.h"
>  #include "config.h"
>  #include "remote.h"
> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>                  VERIFY_BUNDLE_QUIET)))
>       return 1;
>
> + reprepare_packed_git(r);
> +
>

Would it make sense to move this to `bundle.c:unbundle()`, since that is
also where the idx is created?

[snip]
gitgitgadget[bot] commented 3 months ago

User Karthik Nayak <karthik.188@gmail.com> has been added to the cc: list.

gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

Patrick Steinhardt <ps@pks.im> writes:

> Now there are two conflicting thoughts here:
>
>   - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>     should now be accessible.
>
>   - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>     instead call it once after we have fetched all bundles.
>
> The second one feels a bit like premature optimization to me. But the
> first item does feel like it could help us to catch broken bundles
> because we wouldn't end up creating refs for objects that neither we nor
> the bundle have.

I like the way your thoughts are structured around here.

I do agree that the latter is a wrong approach---we shouldn't be
trusting what came from elsewhere over the network without first
checking.  We should probably be running the "index-pack --fix-thin"
the unbundling process runs with also the "--fsck-objects" option if
we are not doing so already, and even then, we should make sure that
the object we are making our ref point at have everything behind it.
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, "Xing Xin" wrote (reply to this):

At 2024-05-17 13:00:49, "Patrick Steinhardt" <ps@pks.im> wrote:
>On Wed, May 15, 2024 at 03:01:09AM +0000, blanet via GitGitGadget wrote:
>> From: Xing Xin <xingxin.xx@bytedance.com>
>
>Long time no see :)

Glad to see you again here :)

>> 
>> So why the `refs/bundles/main` is not discovered? After some digging I
>> found that:
>> 
>> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>>    check its prerequisites if any. The verify procedure would find oids
>>    so `packed_git` is initialized.
>> 
>> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>>    during which `mark_complete_and_common_ref` and `mark_tips` would
>>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>>    enlisted if `packed_git` has already initialized in 1.
>
>And I assume we do not want it to not use `OBJECT_INFO_QUICK`?

I think so. For clones or fetches without using bundle-uri, we can hardly
encounter the case that new packs are added during the negotiation process.
So using `OBJECT_INFO_QUICK` can get some performance gain.

>
>> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
>> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
>> can not find `B` for negotiation at a latter time bacause `B` exists in
>> `incr.pack` which is not enlisted in `packed_git`.
>
>Okay, the explanation feels sensible.
>
>> This commit fixes this by adding a `reprepare_packed_git` call for every
>> successfully unbundled bundle, which ensures to enlist all generated
>> packs from bundle uri. And a set of negotiation related tests are added.
>
>This makes me wonder though. Do we really need to call
>`reprepare_packed_git()` once for every bundle, or can't we instead call
>it once at the end once we have fetched all bundles? Reading on.
>
>> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
>> ---
>>     bundle-uri: refresh packed_git if unbundle succeed
>> 
>> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
>> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
>> 
>>  bundle-uri.c                |   3 +
>>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>>  2 files changed, 132 insertions(+)
>> 
>> diff --git a/bundle-uri.c b/bundle-uri.c
>> index ca32050a78f..2b9d36cfd8e 100644
>> --- a/bundle-uri.c
>> +++ b/bundle-uri.c
>> @@ -7,6 +7,7 @@
>>  #include "refs.h"
>>  #include "run-command.h"
>>  #include "hashmap.h"
>> +#include "packfile.h"
>>  #include "pkt-line.h"
>>  #include "config.h"
>>  #include "remote.h"
>> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>>                     VERIFY_BUNDLE_QUIET)))
>>          return 1;
>>  
>> +    reprepare_packed_git(r);
>> +
>
>So what's hidden here is that `unbundle_from_file()` will also try to
>access the bundle's refs right away. Surprisingly, we do so by calling
>`refs_update_ref()` with `REF_SKIP_OID_VERIFICATION`, which has the
>effect that we basically accept arbitrary object IDs here even if we do
>not know those. That's why we didn't have to `reprepare_packed_git()`
>before this change.

You are right! I tried dropping this `REF_SKIP_OID_VERIFICATION` flag and
the negotiation works as expected.

After some further digging I find that without `REF_SKIP_OID_VERIFICATION`,
both `write_ref_to_lockfile` for files backend and `reftable_be_transaction_prepare`
for reftable backend would call `parse_object` to check the oid. `parse_object`
can help refresh `packed_git` via `reprepare_packed_git`.

>
>Now there are two conflicting thoughts here:
>
>  - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>    should now be accessible.
>
>  - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>    instead call it once after we have fetched all bundles.
>
>The second one feels a bit like premature optimization to me. But the
>first item does feel like it could help us to catch broken bundles
>because we wouldn't end up creating refs for objects that neither we nor
>the bundle have.

I favor the first approach because a validation on the object IDs we are
writing is a safe guard . And the flag itself was designed to be used in
testing scenarios.

/*
 * Blindly write an object_id. This is useful for testing data corruption
 * scenarios.
 */
#define REF_SKIP_OID_VERIFICATION (1 << 10)
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, "Xing Xin" wrote (reply to this):

At 2024-05-17 15:36:53, "Karthik Nayak" <karthik.188@gmail.com> wrote:
>"blanet via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Xing Xin <xingxin.xx@bytedance.com>>
>> When using the bundle-uri mechanism with a bundle list containing
>> multiple interrelated bundles, we encountered a bug where tips from
>> downloaded bundles were not being discovered, resulting in rather slow
>> clones. This was particularly problematic when employing the heuristic
>> `creationTokens`.
>>
>> And this is easy to reproduce. Suppose we have a repository with a
>> single branch `main` pointing to commit `A`, firstly we create a base
>> bundle with
>>
>>   git bundle create base.bundle main
>>
>> Then let's add a new commit `B` on top of `A`, so that an incremental
>> bundle for `main` can be created with
>>
>>   git bundle create incr.bundle A..main
>>
>> Now we can generate a bundle list with the following content:
>>
>>   [bundle]
>>       version = 1
>>       mode = all
>>       heuristic = creationToken
>>
>>   [bundle "base"]
>>       uri = base.bundle
>>       creationToken = 1
>>
>>   [bundle "incr"]
>>       uri = incr.bundle
>>       creationToken = 2
>>
>> A fresh clone with the bundle list above would give the expected
>> `refs/bundles/main` pointing at `B` in new repository, in other words we
>> already had everything locally from the bundles, but git would still
>> download everything from server as if we got nothing.
>>
>> So why the `refs/bundles/main` is not discovered? After some digging I
>> found that:
>>
>> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>
>s/a//

Thanks!

>
>>    check its prerequisites if any. The verify procedure would find oids
>>    so `packed_git` is initialized.
>>
>
>So the flow is:
>1. `fetch_bundle_list` fetches all the bundles advertised via
>`download_bundle_list` to local files.
>2. It then calls `unbundle_all_bundles` to unbundle all the bundles.
>3. Each bundle is then unbundled using `unbundle_from_file`.
>4. Here, we first read the bundle header to get all the prerequisites
>for the bundle, this is done in `read_bundle_header`.
>5. Then we call `unbundle`, which calls `verify_bundle` to ensure that
>the repository does indeed contain the prerequisites mentioned in the
>bundle. Then it creates the index from the bundle file.
>
>So because the objects are being checked, the `prepare_packed_git`
>function is eventually called, which means that the
>`raw_object_store->packed_git` data gets filled in and
>`packed_git_initialized` is set.
>
>This means consecutive calls to `prepare_packed_git` doesn't
>re-initiate `raw_object_store->packed_git` since
>`packed_git_initialized` already is set.
>
>So your explanation makes sense, as a _nit_ I would perhaps add the part
>about why consecutive calls to `prepare_packed_git` are ineffective.

Thanks my friend, you have expressed this issue more clearly. I will
post a new description based on your explanation with the creationToken
case covered.

>
>> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>
>s/unbundled/unbundling

Copy that.

>
>> +#include "packfile.h"
>>  #include "pkt-line.h"
>>  #include "config.h"
>>  #include "remote.h"
>> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>>                     VERIFY_BUNDLE_QUIET)))
>>          return 1;
>>
>> +    reprepare_packed_git(r);
>> +
>>
>
>Would it make sense to move this to `bundle.c:unbundle()`, since that is
>also where the idx is created?
>

I wonder if we need a mental model that we should `reprepare_packed_git`
that when a new pack and its corresponding idx is generated? Currently
whether to call `reprepare_packed_git` is determined by the caller.

But within the scope of this bug, I tend to remove the
`REF_SKIP_OID_VERIFICATION` flag when writing refs as Patrick suggested.

Xing Xin
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, "Xing Xin" wrote (reply to this):

At 2024-05-18 00:14:47, "Junio C Hamano" <gitster@pobox.com> wrote:
>Patrick Steinhardt <ps@pks.im> writes:
>
>> Now there are two conflicting thoughts here:
>>
>>   - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>>     should now be accessible.
>>
>>   - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>>     instead call it once after we have fetched all bundles.
>>
>> The second one feels a bit like premature optimization to me. But the
>> first item does feel like it could help us to catch broken bundles
>> because we wouldn't end up creating refs for objects that neither we nor
>> the bundle have.
>
>I like the way your thoughts are structured around here.
>
>I do agree that the latter is a wrong approach---we shouldn't be
>trusting what came from elsewhere over the network without first
>checking.  We should probably be running the "index-pack --fix-thin"
>the unbundling process runs with also the "--fsck-objects" option if
>we are not doing so already, and even then, we should make sure that
>the object we are making our ref point at have everything behind it.

Currently `unbundle` and all its callers are not adding "--fsck-objects".
There is a param `extra_index_pack_args` for `unbundle` but it is
mainly used for progress related options.

Personally I think data from bundles and data received via network
should be treated equally. For "fetch-pack" we now have some configs
such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
behavior, these configs are invisible when we are fetching bundles.

So for bundles I think we have some different ways to go:

  - Acknowledge the configs mentioned above and behave like
    "fetch-pack".

  - Add "--fsck-objects" as a default in `unbundle`.

  - In `unbundle_from_file`, pass in "--fsck-objects" as
    `extra_index_pack_args` for `unbundle` so this only affect the
    bundle-uri related process.

What do you think?

Xing Xin
blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.v2.git.1716208605926.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v2

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v2
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Xing Xin" <bupt_xingxin@163.com> writes:

> Personally I think data from bundles and data received via network
> should be treated equally.

Yup, that is not personal ;-) but is universally accepted as a good
discipline.  In the case of bundle-uri, the bundle came over the
network so it is even more true that they should be treated the
same.

> For "fetch-pack" we now have some configs
> such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
> behavior, these configs are invisible when we are fetching bundles.

When fetching over network, transport.c:fetch_refs_via_pack() calls
fetch_pack.c:fetch_pack(), which eventually calls get_pack() and the
configuration variables are honored there.  It appears that the
transport layer is unaware of the .fsckobjects configuration knobs.

When fetching from a bundle, transport.c:fetch_refs_from_bundle()
calls bundle.c:unbundle().  This function has three callers, i.e.
"git bundle unbundle", normal fetching from a bundle, and more
recently added bundle-uri codepaths.  

I think one reasonable approach to take is to add an extra parameter
that takes one of three values: (never, use-config, always), and
conditionally add "--fsck-objects" to the command line of the
index-pack.  Teach "git bundle unbundle" the "--fsck-objects" option
so that it can pass 'never' or 'always' from the command line, and
pass 'use-config' from the code paths for normal fetching from a
budnle and bundle-uri.

To implement use-config, you'd probably need to refactor a small
part of fetch-pack.c:get_pack()

    if (fetch_fsck_objects >= 0
        ? fetch_fsck_objects
        : transfer_fsck_objects >= 0
        ? transfer_fsck_objects
        : 0)
        fsck_objects = 1;

into a public function (to support a caller like unbundle() that
comes from sideways, the new function may also need to call
fetch_pack_setup() to prime them).

A patch series may take a structure like so:

 * define enum { UNBUNDLE_FSCK_NEVER, UNBUNDLE_FSCK_ALWAYS } in
   bundle.h, have bundle.c:unbundle() accept a new parameter of that
   type, and conditionally add "--fsck-objects" to its call to
   "index-pack".  "git bundle unbundle" can pass 'never' to its
   invocation to unbundle() as an easy way to test it.  For the
   other two callers, we can start by passing 'always'.

 * (optional) teach "git bundle unbundle" a new "--fsck-objects"
   option to allow passing 'always' to its call to unbundle().  With
   that, add tests to feed it a bundle with questionable objects in
   it and make sure that unbundling notices.

 * refactor fetch-pack.c:get_pack() to make the fetch-then-transfer
   configuration logic available to external callers.

 * Add UNBUNDLE_FSCK_USE_CONFIG to the enum, enhance unbundle() to
   react to the value by calling the helper function you introduced
   in the previous step.

Thanks.
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Karthik Nayak wrote (reply to this):

"blanet via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Xing Xin <xingxin.xx@bytedance.com>
>
> When using the bundle-uri mechanism with a bundle list containing
> multiple interrelated bundles, we encountered a bug where tips from
> downloaded bundles were not discovered, thus resulting in rather slow
> clones. This was particularly problematic when employing the heuristic
> `creationTokens`.
>
> And this is easy to reproduce. Suppose we have a repository with a
> single branch `main` pointing to commit `A`, firstly we create a base
> bundle with
>
>   git bundle create base.bundle main
>
> Then let's add a new commit `B` on top of `A`, so that an incremental
> bundle for `main` can be created with
>
>   git bundle create incr.bundle A..main
>
> Now we can generate a bundle list with the following content:
>
>   [bundle]
>       version = 1
>       mode = all
>       heuristic = creationToken
>
>   [bundle "base"]
>       uri = base.bundle
>       creationToken = 1
>
>   [bundle "incr"]
>       uri = incr.bundle
>       creationToken = 2
>
> A fresh clone with the bundle list above would give the expected
> `refs/bundles/main` pointing at `B` in new repository, in other words we
> already had everything locally from the bundles, but git would still
> download everything from server as if we got nothing.
>
> So why the `refs/bundles/main` is not discovered? After some digging I
> found that:
>
> 1. Bundles in bundle list are downloaded to local files via
>    `download_bundle_list` or via `fetch_bundles_by_token` for the
>    creationToken heuristic case.
> 2. Then it tries to unbundle each bundle via `unbundle_from_file`, which
>    is called by `unbundle_all_bundles` or called within
>    `fetch_bundles_by_token` for the creationToken heuristic case.
> 3. Here, we first read the bundle header to get all the prerequisites
>    for the bundle, this is done in `read_bundle_header`.
> 4. Then we call `unbundle`, which calls `verify_bundle` to ensure that
>    the repository does indeed contain the prerequisites mentioned in the
>    bundle.
> 5. The `verify_bundle` will call `parse_object`, within which the
>    `prepare_packed_git` or `reprepare_packed_git` is eventually called,
>    which means that the `raw_object_store->packed_git` data gets filled
>    in and ``packed_git_initialized` is set. This also means consecutive
>    calls to `prepare_packed_git` doesn't re-initiate
>    `raw_object_store->packed_git` since `packed_git_initialized` already
>    is set.
> 6. If `unbundle` succeeds, it writes some refs via `refs_update_ref`
>    with `REF_SKIP_OID_VERIFICATION` set. So the bundle refs which can
>    target arbitrary objects are written to the repository.
> 7. Finally in `do_fetch_pack_v2`, `mark_complete_and_common_ref` and
>    `mark_tips` are called with `OBJECT_INFO_QUICK` set to find local
>    tips. Here it won't call `reprepare_packed_git` anymore so it would
>    fail to parse oids that only reside in the last bundle.
>
> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
> enlisted to `packed_git` bacause of the prerequisites to verify. While

s/bacause/because

[snip]

> diff --git a/bundle-uri.c b/bundle-uri.c
> index 91b3319a5c1..65666a11d9c 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -400,8 +400,7 @@ static int unbundle_from_file(struct repository *r, const char *file)
>       refs_update_ref(get_main_ref_store(the_repository),
>               "fetched bundle", bundle_ref.buf, oid,
>               has_old ? &old_oid : NULL,
> -             REF_SKIP_OID_VERIFICATION,
> -             UPDATE_REFS_MSG_ON_ERR);
> +             0, UPDATE_REFS_MSG_ON_ERR);
>   }
>
>   bundle_header_release(&header);
> diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
> index 1ca5f745e73..a5b04d6f187 100755
> --- a/t/t5558-clone-bundle-uri.sh
> +++ b/t/t5558-clone-bundle-uri.sh
> @@ -20,7 +20,10 @@ test_expect_success 'fail to clone from non-bundle file' '
>  test_expect_success 'create bundle' '
>   git init clone-from &&
>   git -C clone-from checkout -b topic &&
> +
>   test_commit -C clone-from A &&
> + git -C clone-from bundle create A.bundle topic &&
> +
>   test_commit -C clone-from B &&
>   git -C clone-from bundle create B.bundle topic
>  '
> @@ -259,6 +262,132 @@ test_expect_success 'clone bundle list (file, any mode, all failures)' '
>   ! grep "refs/bundles/" refs
>  '
>
> +#########################################################################
> +# Clone negotiation related tests begin here
> +
> +test_expect_success 'negotiation: bundle with part of wanted commits' '
> + test_when_finished rm -rf trace*.txt &&
> + GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
> + git clone --no-local --bundle-uri="clone-from/A.bundle" \
> +     clone-from nego-bundle-part &&
> + git -C nego-bundle-part for-each-ref --format="%(refname)" >refs &&
> + grep "refs/bundles/" refs >actual &&
> + cat >expect <<-\EOF &&
> + refs/bundles/topic
> + EOF
> + test_cmp expect actual &&
> + # Ensure that refs/bundles/topic are sent as "have".
> + grep "clone> have $(git -C clone-from rev-parse A)" trace-packet.txt
> +'
> +
> +test_expect_success 'negotiation: bundle with all wanted commits' '
> + test_when_finished rm -rf trace*.txt &&
> + GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
> + git clone --no-local --single-branch --branch=topic --no-tags \
> +     --bundle-uri="clone-from/B.bundle" \
> +     clone-from nego-bundle-all &&
> + git -C nego-bundle-all for-each-ref --format="%(refname)" >refs &&
> + grep "refs/bundles/" refs >actual &&
> + cat >expect <<-\EOF &&
> + refs/bundles/topic
> + EOF
> + test_cmp expect actual &&
> + # We already have all needed commits so no "want" needed.
> + ! grep "clone> want " trace-packet.txt
> +'
> +
> +test_expect_success 'negotiation: bundle list (no heuristic)' '
> + test_when_finished rm -f trace*.txt &&
> + cat >bundle-list <<-EOF &&
> + [bundle]
> +     version = 1
> +     mode = all
> +
> + [bundle "bundle-1"]
> +     uri = file://$(pwd)/clone-from/bundle-1.bundle
> +
> + [bundle "bundle-2"]
> +     uri = file://$(pwd)/clone-from/bundle-2.bundle
> + EOF
> +
> + GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
> + git clone --no-local --bundle-uri="file://$(pwd)/bundle-list" \
> +     clone-from nego-bundle-list-no-heuristic &&
> +
> + git -C nego-bundle-list-no-heuristic for-each-ref --format="%(refname)" >refs &&
> + grep "refs/bundles/" refs >actual &&
> + cat >expect <<-\EOF &&
> + refs/bundles/base
> + refs/bundles/left
> + EOF
> + test_cmp expect actual &&
> + grep "clone> have $(git -C nego-bundle-list-no-heuristic rev-parse refs/bundles/left)" trace-packet.txt
> +'
> +
> +test_expect_success 'negotiation: bundle list (creationToken)' '
> + test_when_finished rm -f trace*.txt &&
> + cat >bundle-list <<-EOF &&
> + [bundle]
> +     version = 1
> +     mode = all
> +     heuristic = creationToken
> +
> + [bundle "bundle-1"]
> +     uri = file://$(pwd)/clone-from/bundle-1.bundle
> +     creationToken = 1
> +
> + [bundle "bundle-2"]
> +     uri = file://$(pwd)/clone-from/bundle-2.bundle
> +     creationToken = 2
> + EOF
> +
> + GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
> + git clone --no-local --bundle-uri="file://$(pwd)/bundle-list" \
> +     clone-from nego-bundle-list-heuristic &&
> +
> + git -C nego-bundle-list-heuristic for-each-ref --format="%(refname)" >refs &&
> + grep "refs/bundles/" refs >actual &&
> + cat >expect <<-\EOF &&
> + refs/bundles/base
> + refs/bundles/left
> + EOF
> + test_cmp expect actual &&
> + grep "clone> have $(git -C nego-bundle-list-heuristic rev-parse refs/bundles/left)" trace-packet.txt
> +'
> +
> +test_expect_success 'negotiation: bundle list with all wanted commits' '
> + test_when_finished rm -f trace*.txt &&
> + cat >bundle-list <<-EOF &&
> + [bundle]
> +     version = 1
> +     mode = all
> +     heuristic = creationToken
> +
> + [bundle "bundle-1"]
> +     uri = file://$(pwd)/clone-from/bundle-1.bundle
> +     creationToken = 1
> +
> + [bundle "bundle-2"]
> +     uri = file://$(pwd)/clone-from/bundle-2.bundle
> +     creationToken = 2
> + EOF
> +
> + GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
> + git clone --no-local --single-branch --branch=left --no-tags \
> +     --bundle-uri="file://$(pwd)/bundle-list" \
> +     clone-from nego-bundle-list-all &&
> +
> + git -C nego-bundle-list-all for-each-ref --format="%(refname)" >refs &&
> + grep "refs/bundles/" refs >actual &&
> + cat >expect <<-\EOF &&
> + refs/bundles/base
> + refs/bundles/left
> + EOF
> + test_cmp expect actual &&
> + # We already have all needed commits so no "want" needed.
> + ! grep "clone> want " trace-packet.txt
> +'
> +
>  #########################################################################
>  # HTTP tests begin here
>
>
> base-commit: d8ab1d464d07baa30e5a180eb33b3f9aa5c93adf
> --
> gitgitgadget

This update looks good to me, Thanks.
blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.v3.git.1716816589.gitgitgadget@gmail.com

blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.v3.git.1716824184.gitgitgadget@gmail.com

blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.v3.git.1716824408.gitgitgadget@gmail.com

blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.v3.git.1716824518.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v3

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v3:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v3
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, "Xing Xin" wrote (reply to this):

At 2024-05-21 01:19:02, "Junio C Hamano" <gitster@pobox.com> wrote:
>"Xing Xin" <bupt_xingxin@163.com> writes:
>
>> Personally I think data from bundles and data received via network
>> should be treated equally.
>
>Yup, that is not personal ;-) but is universally accepted as a good
>discipline.  In the case of bundle-uri, the bundle came over the
>network so it is even more true that they should be treated the
>same.
>
>> For "fetch-pack" we now have some configs
>> such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
>> behavior, these configs are invisible when we are fetching bundles.
>
>When fetching over network, transport.c:fetch_refs_via_pack() calls
>fetch_pack.c:fetch_pack(), which eventually calls get_pack() and the
>configuration variables are honored there.  It appears that the
>transport layer is unaware of the .fsckobjects configuration knobs.
>
>When fetching from a bundle, transport.c:fetch_refs_from_bundle()
>calls bundle.c:unbundle().  This function has three callers, i.e.
>"git bundle unbundle", normal fetching from a bundle, and more
>recently added bundle-uri codepaths.  
>
>I think one reasonable approach to take is to add an extra parameter
>that takes one of three values: (never, use-config, always), and
>conditionally add "--fsck-objects" to the command line of the
>index-pack.  Teach "git bundle unbundle" the "--fsck-objects" option
>so that it can pass 'never' or 'always' from the command line, and
>pass 'use-config' from the code paths for normal fetching from a
>budnle and bundle-uri.
>
>To implement use-config, you'd probably need to refactor a small
>part of fetch-pack.c:get_pack()
>
>   if (fetch_fsck_objects >= 0
>       ? fetch_fsck_objects
>       : transfer_fsck_objects >= 0
>       ? transfer_fsck_objects
>       : 0)
>       fsck_objects = 1;
>
>into a public function (to support a caller like unbundle() that
>comes from sideways, the new function may also need to call
>fetch_pack_setup() to prime them).
>
>A patch series may take a structure like so:
>
> * define enum { UNBUNDLE_FSCK_NEVER, UNBUNDLE_FSCK_ALWAYS } in
>   bundle.h, have bundle.c:unbundle() accept a new parameter of that
>   type, and conditionally add "--fsck-objects" to its call to
>   "index-pack".  "git bundle unbundle" can pass 'never' to its
>   invocation to unbundle() as an easy way to test it.  For the
>   other two callers, we can start by passing 'always'.
>
> * (optional) teach "git bundle unbundle" a new "--fsck-objects"
>   option to allow passing 'always' to its call to unbundle().  With
>   that, add tests to feed it a bundle with questionable objects in
>   it and make sure that unbundling notices.

I just submitted a new series mainly focusing on the unbundle handling during
fetches. I would like to submit a new one for teaching "git bundle unbundle" a
"--fsck-objects" option after this to make changes more targeted.

> * refactor fetch-pack.c:get_pack() to make the fetch-then-transfer
>   configuration logic available to external callers.
>
> * Add UNBUNDLE_FSCK_USE_CONFIG to the enum, enhance unbundle() to

I tend to use `UNBUNDLE_FSCK_FOLLOW_FETCH` because this option is only
used in fetches, though the current implementation is indeed reading configs.

>   react to the value by calling the helper function you introduced
>   in the previous step.

The new patch series is constructed right as you suggested, thanks a lot for
your help. 

Xing Xin
blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.v4.git.1717057012.gitgitgadget@gmail.com

blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.v4.git.1717057290.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v4

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v4:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v4
blanet commented 3 months ago

/preview

gitgitgadget[bot] commented 3 months ago

Preview email sent as pull.1730.v5.git.1718085353.gitgitgadget@gmail.com

blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.v5.git.1718088126.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v5

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v5:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v5
blanet commented 3 months ago

/submit

gitgitgadget[bot] commented 3 months ago

Submitted as pull.1730.v6.git.1718109943.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v6

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v6:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v6
gitgitgadget[bot] commented 3 months ago

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Jun 11, 2024 at 12:45:40PM +0000, blanet via GitGitGadget wrote:
> While attempting to fix a reference negotiation bug in bundle-uri, we
> identified that the fetch process lacks some crucial object validation
> checks when processing bundles. The primary issues are:
> 
>  1. In the bundle-uri scenario, object IDs were not validated before writing
>     bundle references. This was the root cause of the original negotiation
>     bug in bundle-uri and could lead to potential repository corruption.
>  2. The existing "fetch.fsckObjects" and "transfer.fsckObjects"
>     configurations were not applied when directly fetching bundles or
>     fetching with bundle-uri enabled. In fact, there were no object
>     validation supports for unbundle.
> 
> The first patch addresses the bundle-uri negotiation issue by removing the
> REF_SKIP_OID_VERIFICATION flag when writing bundle references.
> 
> Patches 2 through 3 extend verify_bundle_flags for bundle.c:unbundle to add
> support for object validation (fsck) in fetch scenarios, mainly following
> the suggestions from Junio and Patrick on the mailing list.

Thanks, this version looks good to me.

Patrick
gitgitgadget[bot] commented 2 months ago

This branch is now known as xx/bundie-uri-fixes.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/7ae269d454c227db7b531943fadd2d79cafdedde.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "New Topics" section about the branch xx/bundie-uri-fixes on the Git mailing list:

When bundleURI interface fetches multiple bundles, Git failed to
take full advantage of all bundles and ended up slurping duplicated
objects.

Will merge to 'next'?
cf. <xmqqa5jruwkr.fsf@gitster.g>
cf. <xmqqsexjtfcg.fsf@gitster.g>
source: <pull.1730.v6.git.1718109943.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/3739a48881f0bfaf17256dd1af3c71b31c0115b5.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "Cooking" section about the branch xx/bundie-uri-fixes on the Git mailing list:

When bundleURI interface fetches multiple bundles, Git failed to
take full advantage of all bundles and ended up slurping duplicated
objects.

Expecting a reroll.
cf. <xmqqa5jruwkr.fsf@gitster.g>
cf. <xmqqsexjtfcg.fsf@gitster.g>
source: <pull.1730.v6.git.1718109943.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/384bfa3daa16a5ff875c8302e80b499d3cbcc6b6.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/7338dd8d449bd01d2cb428713f06b1c25324ecd5.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/85f24287497c67e1a3e2f5b0e13ac1325de9a9e7.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/2065c0b6ca4f66527706b2942de44232abb9ec64.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "Cooking" section about the branch xx/bundie-uri-fixes on the Git mailing list:

When bundleURI interface fetches multiple bundles, Git failed to
take full advantage of all bundles and ended up slurping duplicated
objects.

Expecting a reroll.
cf. <xmqqa5jruwkr.fsf@gitster.g>
cf. <xmqqsexjtfcg.fsf@gitster.g>
source: <pull.1730.v6.git.1718109943.gitgitgadget@gmail.com>
blanet commented 2 months ago

/preview

gitgitgadget[bot] commented 2 months ago

Preview email sent as pull.1730.v7.git.1718627713.gitgitgadget@gmail.com

blanet commented 2 months ago

/submit

gitgitgadget[bot] commented 2 months ago

Submitted as pull.1730.v7.git.1718632535.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v7

To fetch this version to local tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v7:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v7