gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.
http://gitpython.readthedocs.org
BSD 3-Clause "New" or "Revised" License
4.65k stars 906 forks source link

is_dirty() is very slow when using diff.astextplain.textconv #1962

Open idbrii opened 2 months ago

idbrii commented 2 months ago

Running is_dirty() on my repo takes 5 minutes because it's a large repo, has text conversion enabled for diffs, and is_dirty() is outputting a full diff. is_dirty() should be a relatively simple operation, but since it uses git diff instead of a plumbing command like git diff-files it incurs the cost of displaying nice output for users.

The diff.astextplain.textconv git option converts pdf files to text before diffing. This option appears to come with msys git. It's useful when diffing interactively, but a lot of overhead when just checking for dirty state.

GitPython doesn't look at the output of the diff, it just checks that it's not empty:

https://github.com/gitpython-developers/GitPython/blob/3470fb3e5ff7f77e5bd19bc264163cd31db4a5df/git/repo/base.py#L957-L977

If we switch from the diff to diff-index, we can see that it's comparable in speed to turning off the text conversions:

$ time (git diff --abbrev=40 --full-index --raw | cat)
...snip...
:100644 100644 66130ffa9aa8b4bc98a1918a946919f94c9a819d 0000000000000000000000000000000000000000 M      src/thread.cpp

real    5m14.842s
user    1m6.922s
sys     4m20.000s

$ time (git diff-index --quiet HEAD -- && echo "clean" || echo "dirty")

real    0m4.517s
user    0m0.203s
sys     0m32.281s

$ echo "*.pdf   -diff=astextplain" > .gitattributes

$ time (git diff --abbrev=40 --full-index --raw | cat)
...snip...
:100644 100644 66130ffa9aa8b4bc98a1918a946919f94c9a819d 0000000000000000000000000000000000000000 M      src/thread.cpp

real    0m4.394s
user    0m0.109s
sys     0m32.250s

(These timings are all after running these command several times. When I first ran git diff it took 26 minutes!)

Workaround

Add a .gitattributes that disables the text conversion:

*.pdf   -diff=astextplain

Solution

I think is_dirty should instead use diff-files and diff-index. This answer looks like a good explanation of how they work.

Here's my rough replacement for is_dirty:

import git

def is_dirty(
    repo,
    index: bool = True,
    working_tree: bool = True,
    untracked_files: bool = False,
    submodules: bool = True,
    path=None,
):
    default_args = []
    if submodules:
        default_args.append("--ignore-submodules")

    if index:
        try:
            # Always want to end with -- (even without path).
            args = default_args + ["--quiet", "--cached", "HEAD", "--", path]
            repo.git.diff_index(*args)
        except git.exc.GitCommandError:
            return True
    if working_tree:
        try:
            args = default_args + ["--quiet", "--", path]
            repo.git.diff_files(*args)
        except git.exc.GitCommandError:
            return True
    if untracked_files:
        if len(repo._get_untracked_files(path, ignore_submodules=not submodules)):
            return True
    return False

repo = git.Repo.init("c:/code/project")
print("is_dirty", is_dirty(repo))

I'll try to find time to make a proper patch if that sounds good.

Byron commented 2 months ago

Thanks so much for this wonderful issue, and for figuring out a solution as well! Yes, please do post a PR for this, the approach sounds perfect.