gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.
http://gitpython.readthedocs.org
BSD 3-Clause "New" or "Revised" License
4.62k stars 907 forks source link

stacktrace; `pump stream` fails when commit diff is on a file with a colon in the name #1210

Open tommyjcarpenter opened 3 years ago

tommyjcarpenter commented 3 years ago

Hi

We use this code to check whether a git repo contains only "new" files or whether a file was modified:

def git_check_added_only(repo: Repo):
    prev_commit = repo.commit("HEAD~1")
    head_commit = repo.head.commit
    modified_files = list(prev_commit.diff(head_commit).iter_change_type("M"))
    if len(modified_files) > 0:
        ...

When running this on repos with only csv data, it's fine.

However when running this on repos with binary files in it, like .xlxs, we blow in a stack trace like this; but because it's threaded I am not sure which exact line above is internally causing this "pump"

b"ERROR:git.cmd:Pumping 'stdout' of cmd(['git', 'diff-tree', '8fd339d63f824428d215be363817d637ffae3430', 'ca2fecd4f638c81af31ef7e1a22d48b68a55d746', '-r', '--abbrev=40', '--full-index', '-M', '--raw', '-z', '--no-color']) failed due to: ValueError('not enough values to unpack (expected 5, got 4)')\n"
b'Exception in thread Thread-25:\n'
b'Traceback (most recent call last):\n'
b'  File "/home/mypackage/.local/lib/python3.7/site-packages/git/cmd.py", line 83, in pump_stream\n'
b'    handler(line)\n'
b'  File "/home/mypackage/.local/lib/python3.7/site-packages/git/diff.py", line 488, in handle_diff_line\n'
b'    old_mode, new_mode, a_blob_id, b_blob_id, _change_type = meta.split(None, 4)\n'
b'ValueError: not enough values to unpack (expected 5, got 4)\n'
b'\n'
b'The above exception was the direct cause of the following exception:\n'
b'\n'
b'Traceback (most recent call last):\n'
b'  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner\n'
b'    self.run()\n'
b'  File "/usr/local/lib/python3.7/threading.py", line 870, in run\n'
b'    self._target(*self._args, **self._kwargs)\n'
b'  File "/home/mypackage/.local/lib/python3.7/site-packages/git/cmd.py", line 86, in pump_stream\n'
b"    raise CommandError(['<%s-pump>' % name] + cmdline, ex) from ex\n"
b"git.exc.CommandError: Cmd('<stdout-pump>') failed due to: ValueError('not enough values to unpack (expected 5, got 4)')\n"
b'  cmdline: <stdout-pump> git diff-tree 8fd339d63f824428d215be363817d637ffae3430 ca2fecd4f638c81af31ef7e1a22d48b68a55d746 -r --abbrev=40 --full-index -M --raw -z --no-color\n'
b'\n'

Near that line of code I see https://github.com/gitpython-developers/GitPython/blob/main/git/diff.py#L498

        # handles
        # :100644 100644 687099101... 37c5e30c8... M    .gitignore

I do not know how to get to this output on my git terminal. I was going to try to make a test case here seeing if this looks different for binary files (basically make a PDF or an excel file, then edit it, then commit it; but reading the log doesnt show such a format)

We do not pin the version of gitpython and we rebuild this docker container fairly often so this should be running with the latest version in pypi (488 on your master branch does not coorespond to that line of code; however you've had commits to master since the pypi release)

Byron commented 3 years ago

Thanks for the report.

Would you be able to run git diff-tree 8fd339d63f824428d215be363817d637ffae3430 ca2fecd4f638c81af31ef7e1a22d48b68a55d746 -r --abbrev=40 --full-index -M --raw -z --no-color in the terminal and check the output? With that along with the parsing code here…

https://github.com/gitpython-developers/GitPython/blob/ea43defd777a9c0751fc44a9c6a622fc2dbd18a0/git/diff.py#L509

…you should be able to make out the line that would break such an assumption.

It would be interesting to see if the issue could have a workaround by avoiding to force parsing all output, and instead stopping at the first encountered modification, maybe like so:

if next(prev_commit.diff(head_commit).iter_change_type("M")) is not None: …
tommyjcarpenter commented 3 years ago

@Byron I can add some logging in between each of the three lines in:

def git_check_added_only(repo: Repo):
    prev_commit = repo.commit("HEAD~1")
    head_commit = repo.head.commit
    modified_files = list(prev_commit.diff(head_commit).iter_change_type("M"))
        ...

to help narrow this down.

We want a list of all modified files in our code, so we don't want to stop at the first one, but I'll log what happens with the code you suggested too.

However this runs as part of an automated job in kubernetes that does a git clone and runs some checks on a git repo, so it's not easy to add a terminal intervention

Byron commented 3 years ago

Thanks. I would hope there is a way for you to clone this repo yourself and execute the suggested command-line by hand, it should be reproducible locally. Logging between the lines might be useful if it reveals exactly what GitPython tries to parse - something I wouldn't know how to do in that case without patching GitPython itself.

Ultimately, GitPython encounters a format it doesn't know, and knowing the input should allow to fix it.

As you are indeed interested in the entire list, there shouldn't be a need to try the alternative one for a workaround.

tommyjcarpenter commented 3 years ago

@Byron ok here is a completely reproducible codesample; except I can't give you the git repo that contains the files. Im going to try running your git diff command next

To trigger a stack trace, I go into /tmp/deadlyrepo and edit a .xlxs file, and NOT commit it manually, but rather with this code:

from git import Repo, Actor

def git_check_added_only(repo: Repo):
    prev_commit = repo.commit("HEAD~1")
    head_commit = repo.head.commit
    modified_files = list(prev_commit.diff(head_commit).iter_change_type("M"))
    if len(modified_files) > 0:
        print(modified_files)
        assert 1 == 2

local_repo = Repo("/tmp/deadlyrepo")

local_repo.git.add(A=True)  # https://github.com/gitpython-developers/GitPython/issues/292
if not local_repo.index.diff("HEAD"):
    print("Nothing new to commit/push")

local_repo.index.commit("asdf", author=Actor("foo", "foo@t.com"))

git_check_added_only(local_repo)

this will blow in the stack trace:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/cmd.py", line 83, in pump_stream
    handler(line)
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/diff.py", line 488, in handle_diff_line
    old_mode, new_mode, a_blob_id, b_blob_id, _change_type = meta.split(None, 4)
ValueError: not enough values to unpack (expected 5, got 4)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/tommycarpenter/.pyenv/versions/3.7.9/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/tommycarpenter/.pyenv/versions/3.7.9/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/cmd.py", line 86, in pump_stream
    raise CommandError(['<%s-pump>' % name] + cmdline, ex) from ex
git.exc.CommandError: Cmd('<stdout-pump>') failed due to: ValueError('not enough values to unpack (expected 5, got 4)')
  cmdline: <stdout-pump> git diff -R 0ea0a2ab4e0f0e97eac3078abfa64a68a03d15b1 --cached --abbrev=40 --full-index -M --raw -z --no-color

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/cmd.py", line 83, in pump_stream
    handler(line)
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/diff.py", line 488, in handle_diff_line
    old_mode, new_mode, a_blob_id, b_blob_id, _change_type = meta.split(None, 4)
ValueError: not enough values to unpack (expected 5, got 4)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/tommycarpenter/.pyenv/versions/3.7.9/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/tommycarpenter/.pyenv/versions/3.7.9/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/tommycarpenter/.pyenv/versions/gitpython/lib/python3.7/site-packages/git/cmd.py", line 86, in pump_stream
    raise CommandError(['<%s-pump>' % name] + cmdline, ex) from ex
git.exc.CommandError: Cmd('<stdout-pump>') failed due to: ValueError('not enough values to unpack (expected 5, got 4)')
  cmdline: <stdout-pump> git diff-tree 0ea0a2ab4e0f0e97eac3078abfa64a68a03d15b1 730f97cb0419d1aa2a82c4c31763e4a7f4e8fa4c -r --abbrev=40 --full-index -M --raw -z --no-color

[<git.diff.Diff object at 0x10e77cd70>]
Traceback (most recent call last):
  File "git_test.py", line 23, in <module>
    git_check_added_only(local_repo)
  File "git_test.py", line 10, in git_check_added_only
    assert 1 == 2
AssertionError
tommyjcarpenter commented 3 years ago

@Byron in terms of your other request; ive only edited a sensitive foldername to yyy and a sensitive customer name of the filenames to be XXX

git diff-tree 8fd339d63f824428d215be363817d637ffae3430 ca2fecd4f638c81af31ef7e1a22d48b68a55d746 -r --abbrev=40 --full-index -M --raw -z --no-color                                                                                                          23.2s  Tue Apr 13 08:55:08 2021

PRODUCES

:100644 100644 7fbeb30009ba78ebd3631d706ec70938b4406968 a8e504c39d6d7bec6d21c1df14b3dc4774693306 Myyy/20210406 18:00 XXX Report.xlsx:100644 100644 8660b3fc9cdb936dce715f63e674a8f24bf74291 104c263f955de0569d9fcb9575316cc15abc17d7 Myyy/20210407 18:00 EOD XXX Report.xlsx:100644 100644 c17fe3a7472e9d95cb052d08781920c336d7de49 f5653ac1ecdb7c300074cee36e70e89d9b1b9487 Myyy/20210408 18:00 EOD XXX Report.xlsx⏎

I actually bet this has to do with the fact that these filenames contain spaces (20210406 18:00 XXX Report.xlsx), and nothing to do with binary

Byron commented 3 years ago

Thanks a lot. It turns out that colons : cause the parsing to fail as it also happens to be a separation character. Usually these are escaped, something the parser also couldn't deal with, but thanks to -z which allows spaces in paths there is no escaping anymore.

With a failing test added, it should be possible to implement a fix by improving the parser.

https://github.com/gitpython-developers/GitPython/blob/651a81ded00eb993977bcdc6d65f157c751edb02/git/diff.py#L497

To my mind, a more sane way is to rely on the presence of NULL bytes and split across them instead. There are subtleties with renames showing multiple paths, each of which separated by NULL bytes.

On top of that, the parser implementation is actually a streaming parser which will break if there are newlines in paths or if the line read is truncated (maybe due to a file path being very, very long.

https://github.com/gitpython-developers/GitPython/blob/651a81ded00eb993977bcdc6d65f157c751edb02/git/cmd.py#L80

tommyjcarpenter commented 3 years ago

thanks for your investigation!

btw, if the thought of colons and spaces in filenames makes you cringe.. us too - but we have to sync the customer files exactly as we receive them, so we aren't in a position to simply say "don't do that".

Byron commented 3 years ago

You are welcome! That’s a great motivation to submit a PR I hope, making the new test pass should be all it takes.

In that moment it would be beneficial to also avoid streaming entirely and just store gits output in a string at once, parsing it afterwards, eliminating another error source in the process.

onthebridgetonowhere commented 2 years ago

Thanks @tommyjcarpenter - I found a similar problem in another project. I hope it's ok to re-use this issue for more "test" cases. A similar problem happens when working with the Airavata open source project.

Traceback (most recent call last):
  File "\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "\Programs\Python\Python310\lib\threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "\Programs\Python\Python310\lib\site-packages\git\cmd.py", line 112, in pump_stream
    raise CommandError(['<%s-pump>' % name] + remove_password_if_present(cmdline), ex) from ex
git.exc.CommandError: Cmd('<stdout-pump>') failed due to: ValueError('not enough values to unpack (expected 5, got 1)')
  cmdline: <stdout-pump> git diff-tree 81ad6eadf2df50a5f9abfbbe8f18fe6f98fc63d1 00b8aaf0521ecfc17b0dc4ef53fc30376d4716be -r --abbrev=40 --full-index -M --raw -z --no-color 

If I run the command manually, I get the following

:000000 100644 0000000000000000000000000000000000000000 24bf1af74e56848c6ba20f2e5a973115890a58b4 A airavata-api/airavata-client-sdks/airavata-cpp-sdk/src/main/resources/client samples/test.cpp~ :000000 100644 0000000000000000000000000000000000000000 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 A airavata-api/airavata-client-sdks/airavata-cpp-sdk/src/main/resources/client samples/test:c~

If I run the command without -z, it seems to be fine:

:000000 100644 0000000000000000000000000000000000000000 24bf1af74e56848c6ba20f2e5a973115890a58b4 A      airavata-api/airavata-client-sdks/airavata-cpp-sdk/src/main/resources/client samples/test.cpp~
:000000 100644 0000000000000000000000000000000000000000 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 A      airavata-api/airavata-client-sdks/airavata-cpp-sdk/src/main/resources/client samples/test:c~

Any idea how to fix this?