git-for-windows / git

A fork of Git containing Windows-specific patches.
http://gitforwindows.org/
Other
8.36k stars 2.53k forks source link

Slow checkout when repository contains large amount of symlinks #4059

Open lin-xianming opened 2 years ago

lin-xianming commented 2 years ago

Setup

$ git --version --build-options

git version 2.38.0.windows.1
cpu: x86_64
built from commit: 0355083fbe5582f6d3f819afc130ed2f2375e0bd
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
$ cmd.exe /c ver

Microsoft Windows [Version 10.0.19043.2006]
# One of the following:
> type "C:\Program Files\Git\etc\install-options.txt"
> type "C:\Program Files (x86)\Git\etc\install-options.txt"
> type "%USERPROFILE%\AppData\Local\Programs\Git\etc\install-options.txt"
> type "$env:USERPROFILE\AppData\Local\Programs\Git\etc\install-options.txt"
$ cat /etc/install-options.txt

Editor Option: VIM
Custom Editor Path:
Default Branch Option: master
Path Option: BashOnly
SSH Option: OpenSSH
Tortoise Option: false
CURL Option: OpenSSL
CRLF Option: CRLFCommitAsIs
Bash Terminal Option: MinTTY
Git Pull Behavior Option: Merge
Use Credential Manager: Disabled
Performance Tweaks FSCache: Enabled
Enable Symlinks: Enabled
Enable Pseudo Console Support: Disabled
Enable FSMonitor: Disabled

Windows developer mode is enabled so symlinks can be created without privilege elevation.

Details

Bash

git clone http://psydata.ovgu.de/forrest_gump/.git

Repository would be cloned and working tree checked out in a reasonable amount of time.

Repository was cloned and checkout began normally, but slowed down to 5-20 files per second after a few seconds, and one CPU core was fully loaded. There are 14693 files to checkout in total and about 7000 remaining when it slowed down. At 20 files per second it would have taken 5.8 more minutes. The same clone and checkout on Linux took less than 5 seconds.

For example http://psydata.ovgu.de/forrest_gump/.git. Seems to affect any git-annex repository with large amount of symlinks.

dscho commented 2 years ago

This behavior is most likely due to the extra checks we have to go through (see 6ad3d3db7372717de578088ce65f6262c37ec20c) to determine the type of the symlink. On Linux/Unix, symlinks do not distinguish between file targets and directory targets, but on Windows they do. And since still too much of Git assumes Linux semantics, we have to work extra hard around that.

Is there an easy way to figure out the types of the symlinks contained in your repository? If so, it might make sense to declare them (either in a .gitattributes file that is contained in the repository, or in a .git/info/attributes file that is local to your checkout; in the latter case you will want to clone with --no-checkout, then initialize that file, then call git checkout <branch>).

An alternative, if that is not a viable approach, would be to perform a parallelized checkout that uses all of your CPU's cores (or uses an even higher number if the operation is I/O bound).

lin-xianming commented 2 years ago

Extra checks for each symlink does not explain why checkout speed slows down over time. I'm testing on an SSD and there are no disk bottlenecks. With symlink=file, there are no differences in checkout speed. With checkout.workers=-1, a second core was briefly loaded before checkout speed slowed down to the same as before and only a single core was loaded.

dscho commented 1 year ago

With symlink=file, there are no differences in checkout speed.

Hmm. That's funny. Could you investigate further, e.g. by instrumenting the code e.g. with Trace2 statements?

lin-xianming commented 1 year ago

git is repeatedly accessing symlinks that are already checked out, leading to slower checkout the more symlinks are checked out. One thing of note is that for a repository like the example given in the bug report, all the symlink targets will not exist when the repository is cloned.

image

dscho commented 1 year ago

That should not happen when configuring the symlink=file Git attribute.

lin-xianming commented 1 year ago

This problem seems to be specific to git-annex repositories with large amount of symlinks like the one linked in the bug report. I created a repository with 10000 symlinks with non-existent targets with for i in {1..10000}; do ln -s ../$i $i; done and did not experience any problems with cloning and checkout. I deleted most of the symlinks from the example repository and also did not see any repeated access with procmon.

dscho commented 1 year ago

@lin-xianming it should be really interesting to learn what you figure out investigating this further.