gitgitgadget / git

GitGitGadget's Git fork. Open Pull Requests here to submit them to the Git mailing list
https://gitgitgadget.github.io/
Other
208 stars 135 forks source link

sparse-index: improve clear_skip_worktree_from_present_files() #1754

Closed derrickstolee closed 2 months ago

derrickstolee commented 2 months ago

While doing some investigation in a private monorepo with sparse-checkout and a sparse index, I accidentally left a modified file outside of my sparse-checkout cone. This caused my Git commands to slow to a crawl, so I reran with GIT_TRACE2_PERF=1.

While I was able to identify clear_skip_worktree_from_present_files() as the culprit, it took longer than desired to figure out what was going on. This series intends to both fix the performance issue (as much as possible) and do some refactoring to make it easier to understand what is happening.

In the end, I was able to reduce the number of lstat() calls in my case from over 1.1 million to about 4,400, improving the time from 13.4s to 81ms on a warm disk cache. (These numbers are from a test after v2, which somehow hit the old caching algorithm even worse than my test in v1.)

Updates in v3

Thanks, Stolee

cc: gitster@pobox.com cc: newren@gmail.com cc: anh@canva.com

gitgitgadget[bot] commented 2 months ago

There are issues in commit 15d23deb6b0da0b70390da441dc78f2d7e0e8495: sparse-index: refactor skip worktree retry logic Lines in the body of the commit messages should be wrapped between 60 and 76 characters. Indented lines, and lines without whitespace, are exempt

derrickstolee commented 2 months ago

There are issues in commit 15d23de: sparse-index: refactor skip worktree retry logic Lines in the body of the commit messages should be wrapped between 60 and 76 characters. Indented lines, and lines without whitespace, are exempt

I'm ignoring this because the lines are from trace2 output, so wrapping would not be the right thing to do in this case.

gitgitgadget[bot] commented 2 months ago

There are issues in commit bbd70d3245def510696114a20c420b360182175f: sparse-index: refactor skip worktree retry logic Lines in the body of the commit messages should be wrapped between 60 and 76 characters. Indented lines, and lines without whitespace, are exempt

gitgitgadget[bot] commented 2 months ago

There are issues in commit 1247aeb9b4ce1a6bbaba40bac38327429bd150e5: sparse-index: refactor skip worktree retry logic Lines in the body of the commit messages should be wrapped between 60 and 76 characters. Indented lines, and lines without whitespace, are exempt

derrickstolee commented 2 months ago

/submit

gitgitgadget[bot] commented 2 months ago

There are issues in commit 1247aeb9b4ce1a6bbaba40bac38327429bd150e5: sparse-index: refactor skip worktree retry logic Lines in the body of the commit messages should be wrapped between 60 and 76 characters. Indented lines, and lines without whitespace, are exempt

derrickstolee commented 2 months ago

/submit

gitgitgadget[bot] commented 2 months ago

Submitted as pull.1754.git.1718899877.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1754/derrickstolee/clear-skip-speed-v1

To fetch this version to local tag pr-1754/derrickstolee/clear-skip-speed-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1754/derrickstolee/clear-skip-speed-v1
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> While doing some investigation in a private monorepo with sparse-checkout
> and a sparse index, I accidentally left a modified file outside of my
> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
> reran with GIT_TRACE2_PERF=1.
>
> While I was able to identify clear_skip_worktree_from_present_files() as the
> culprit, it took longer than desired to figure out what was going on. This
> series intends to both fix the performance issue (as much as possible) and
> do some refactoring to make it easier to understand what is happening.
>
> In the end, I was able to reduce the number of lstat() calls in my case from
> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
> disk cache.   Thanks, Stolee

That's impressive but I cannot offhand tell how big 170k (or 6.5k
for that matter) is relative to the size of the tree.  How many
paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
-l") vs the number of the in-cone paths in the working tree?  

If 6.5k is in the same ballpark as the latter, it would be really
good.

Thanks.
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/20/24 3:16 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >> While doing some investigation in a private monorepo with sparse-checkout
>> and a sparse index, I accidentally left a modified file outside of my
>> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
>> reran with GIT_TRACE2_PERF=1.
>>
>> While I was able to identify clear_skip_worktree_from_present_files() as the
>> culprit, it took longer than desired to figure out what was going on. This
>> series intends to both fix the performance issue (as much as possible) and
>> do some refactoring to make it easier to understand what is happening.
>>
>> In the end, I was able to reduce the number of lstat() calls in my case from
>> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
>> disk cache.   Thanks, Stolee
> > That's impressive but I cannot offhand tell how big 170k (or 6.5k
> for that matter) is relative to the size of the tree.  How many
> paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
> -l") vs the number of the in-cone paths in the working tree?
> > If 6.5k is in the same ballpark as the latter, it would be really
> good.

You're right, I didn't include the full context here. The repo has
about 2.1 million paths at HEAD, but most of them are sparse.

In Patch 5, I detail that there are 1,841,997 total sparse files in
the expanded index. Thus, the previous caching algorithm was already
doing decent work and calling lstat() 11x fewer times than the naive
implementation.

The new caching algorithm improves this to 6,521, which is a 282x
improvement over naive and and 26x improvement over the previous
caching algorithm.

But what you are really asking is how close this is to the optimal.
I didn't include that in Patch 5 details, but I was able to look at
my notes and see that the sparse_path_count data point was 1,962,
meaning there are that many sparse trees in the sparse index before
expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
absolute minimum required.

Does that help answer the questions you had? I'm happy to provide
more information.

Thanks,
-Stolee
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> But what you are really asking is how close this is to the optimal.
> I didn't include that in Patch 5 details, but I was able to look at
> my notes and see that the sparse_path_count data point was 1,962,
> meaning there are that many sparse trees in the sparse index before
> expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
> absolute minimum required.

Nice, and still impressive.  Thanks.
gitgitgadget[bot] commented 2 months ago

This branch is now known as ds/sparse-lstat-caching.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/f368eabff65193bc82f153cf160ff1ae7cd7e1cc.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "New Topics" section about the branch ds/sparse-lstat-caching on the Git mailing list:

The code to deal with modified paths that are out-of-cone in a
sparsely checked out working tree has been optimized.

Needs review.
source: <pull.1754.git.1718899877.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/52a439864c2a2d499925242692ec0854b8bff783.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/bc3e4f348c9a5cb510f70378dc9fe09467906287.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "Cooking" section about the branch ds/sparse-lstat-caching on the Git mailing list:

The code to deal with modified paths that are out-of-cone in a
sparsely checked out working tree has been optimized.

Needs review.
source: <pull.1754.git.1718899877.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/ee4dc1f744e78456fefcf359964fe22c58730df7.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/73eb58ebf64bd5c9147c6518de8013b4de112d6f.

derrickstolee commented 2 months ago

/submit

gitgitgadget[bot] commented 2 months ago

Submitted as pull.1754.v2.git.1719412192.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1754/derrickstolee/clear-skip-speed-v2

To fetch this version to local tag pr-1754/derrickstolee/clear-skip-speed-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1754/derrickstolee/clear-skip-speed-v2
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/a4bc0c5769022a4711150a1bd25e8c533399d92a.

gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Updates in v2
> =============
>
> Thanks to Elijah for a thorough review, leading to valuable improvements.
>
>  * I was mistaken that the sparse index was required for this logic to
>    happen. This has changed several descriptions across the commit messages.
>  * The final lstat() in path_found() was not needed, so is removed in v2.
>    This saves even more time and lstat() calls, updating the stats.
>  * Elijah created a particularly nasty example for testing, which I include
>    in my final patch. He gets a "Helped-by" credit for this.
>  * Several comments, variables, and other improvements based on Elijah's
>    recommendations.
>
> Thanks, Stolee

Thanks, both.  This round was a pleasant read.

Let's mark it for 'next' soonish.
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/7ab61c5d560897e390cb46186a44944b0fd0508e.

gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Elijah Newren wrote (reply to this):

On Thu, Jun 27, 2024 at 2:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Updates in v2
> > =============
> >
> > Thanks to Elijah for a thorough review, leading to valuable improvements.
> >
> >  * I was mistaken that the sparse index was required for this logic to
> >    happen. This has changed several descriptions across the commit messages.
> >  * The final lstat() in path_found() was not needed, so is removed in v2.
> >    This saves even more time and lstat() calls, updating the stats.
> >  * Elijah created a particularly nasty example for testing, which I include
> >    in my final patch. He gets a "Helped-by" credit for this.
> >  * Several comments, variables, and other improvements based on Elijah's
> >    recommendations.
> >
> > Thanks, Stolee
>
> Thanks, both.  This round was a pleasant read.
>
> Let's mark it for 'next' soonish.

Yeah, this version looks pretty good.  There was a possible further
improvement you suggested, but I think that wouldn't need to hold up
this series.

However, there is one paragraph from the commit message of 1/5 that
I'd like to see stricken first (or rewritten).  Once that's done, I
think this series is ready for next.
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/27/24 8:59 PM, Elijah Newren wrote:
> On Thu, Jun 27, 2024 at 2:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> Updates in v2
>>> =============
>>>
>>> Thanks to Elijah for a thorough review, leading to valuable improvements.
>>>
>>>   * I was mistaken that the sparse index was required for this logic to
>>>     happen. This has changed several descriptions across the commit messages.
>>>   * The final lstat() in path_found() was not needed, so is removed in v2.
>>>     This saves even more time and lstat() calls, updating the stats.
>>>   * Elijah created a particularly nasty example for testing, which I include
>>>     in my final patch. He gets a "Helped-by" credit for this.
>>>   * Several comments, variables, and other improvements based on Elijah's
>>>     recommendations.
>>>
>>> Thanks, Stolee
>>
>> Thanks, both.  This round was a pleasant read.
>>
>> Let's mark it for 'next' soonish.
> > Yeah, this version looks pretty good.  There was a possible further
> improvement you suggested, but I think that wouldn't need to hold up
> this series.
> > However, there is one paragraph from the commit message of 1/5 that
> I'd like to see stricken first (or rewritten).  Once that's done, I
> think this series is ready for next.

I agree about the removal of that paragraph, but also there is a
"largest" to "longest" replacement in a few comments that makes sense
to change.

I can send a v3 with these changes.

Thanks,
-Stolee
derrickstolee commented 2 months ago

/submit

gitgitgadget[bot] commented 2 months ago

Submitted as pull.1754.v3.git.1719578605.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1754/derrickstolee/clear-skip-speed-v3

To fetch this version to local tag pr-1754/derrickstolee/clear-skip-speed-v3:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1754/derrickstolee/clear-skip-speed-v3
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Jun 28, 2024 at 5:43 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> While doing some investigation in a private monorepo with sparse-checkout
> and a sparse index, I accidentally left a modified file outside of my
> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
> reran with GIT_TRACE2_PERF=1.
>
> While I was able to identify clear_skip_worktree_from_present_files() as the
> culprit, it took longer than desired to figure out what was going on. This
> series intends to both fix the performance issue (as much as possible) and
> do some refactoring to make it easier to understand what is happening.
>
> In the end, I was able to reduce the number of lstat() calls in my case from
> over 1.1 million to about 4,400, improving the time from 13.4s to 81ms on a
> warm disk cache. (These numbers are from a test after v2, which somehow hit
> the old caching algorithm even worse than my test in v1.)
>
>
> Updates in v3
> =============
>
>  * Removed the incorrect paragraph in the commit message of patch 1.
>  * Replaced "largest" with "longest" in the final patch.
>
> Thanks, Stolee
>
> Derrick Stolee (5):
>   sparse-checkout: refactor skip worktree retry logic
>   sparse-index: refactor path_found()
>   sparse-index: use strbuf in path_found()
>   sparse-index: count lstat() calls
>   sparse-index: improve lstat caching of sparse paths
>
>  sparse-index.c | 216 +++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 164 insertions(+), 52 deletions(-)
>
>
> base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/1754
>
> Range-diff vs v2:
>
>  1:  93d0baed0b0 ! 1:  0844cda94cf sparse-checkout: refactor skip worktree retry logic
>      @@ Commit message
>           stored in the index, so caching was introduced in d79d299352 (Accelerate
>           clear_skip_worktree_from_present_files() by caching, 2022-01-14).
>
>      -    If users are having trouble with the performance of this operation and
>      -    don't care about paths outside of the sparse-checkout, they can disable
>      -    them using the sparse.expectFilesOutsideOfPatterns config option
>      -    introduced in ecc7c8841d (repo_read_index: add config to expect files
>      -    outside sparse patterns, 2022-02-25).
>      -
>           This check is particularly confusing in the presence of a sparse index,
>           as a sparse tree entry corresponding to an existing directory must first
>           be expanded to a full index before examining the paths within. This is
>  2:  69c3beaabf7 = 2:  c242e2c9168 sparse-index: refactor path_found()
>  3:  0a82e6b4183 = 3:  ad63bf746ca sparse-index: use strbuf in path_found()
>  4:  9549f5b8062 = 4:  db6ded0df0d sparse-index: count lstat() calls
>  5:  0cb344ac14f ! 5:  1f58e19691f sparse-index: improve lstat caching of sparse paths
>      @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
>        }
>
>       +/**
>      -+ * Return the length of the largest common substring that ends in a
>      -+ * slash ('/') to indicate the largest common parent directory. Returns
>      ++ * Return the length of the longest common substring that ends in a
>      ++ * slash ('/') to indicate the longest common parent directory. Returns
>       + * zero if no common directory exists.
>       + */
>       +static size_t max_common_dir_prefix(const char *path1, const char *path2)
>
> --
> gitgitgadget

This version covers the last two outstanding items.

Reviewed-by: Elijah Newren <newren@gmail.com>
gitgitgadget[bot] commented 2 months ago

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> This version covers the last two outstanding items.
>
> Reviewed-by: Elijah Newren <newren@gmail.com>

Thanks both.  Will mark it for 'next'.
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/2039cf1924f5f06acfbfc380cabffed78620dc30.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into next via https://github.com/git/git/commit/9efaea181d5b688a5e3755bd130c146af73a1e61.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "Cooking" section about the branch ds/sparse-lstat-caching on the Git mailing list:

The code to deal with modified paths that are out-of-cone in a
sparsely checked out working tree has been optimized.

Will merge to 'master'.
cf. <CABPp-BFd7Bk68Omdao5LS0sP5bK1WQ7V6dodB5x8EsncNARxNA@mail.gmail.com>
source: <pull.1754.v3.git.1719578605.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/042c457560220cf1ac102664044c251ebbac800e.

gitgitgadget[bot] commented 2 months ago

There was a status update in the "Cooking" section about the branch ds/sparse-lstat-caching on the Git mailing list:

The code to deal with modified paths that are out-of-cone in a
sparsely checked out working tree has been optimized.

Will merge to 'master'.
cf. <CABPp-BFd7Bk68Omdao5LS0sP5bK1WQ7V6dodB5x8EsncNARxNA@mail.gmail.com>
source: <pull.1754.v3.git.1719578605.gitgitgadget@gmail.com>
gitgitgadget[bot] commented 2 months ago

This patch series was integrated into seen via https://github.com/git/git/commit/a43b001cce61161f838387e18ee5acdb73b17493.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into master via https://github.com/git/git/commit/a43b001cce61161f838387e18ee5acdb73b17493.

gitgitgadget[bot] commented 2 months ago

This patch series was integrated into next via https://github.com/git/git/commit/a43b001cce61161f838387e18ee5acdb73b17493.

gitgitgadget[bot] commented 2 months ago

Closed via a43b001cce61161f838387e18ee5acdb73b17493.