Crash with unicode filename and Python3

petertodd commented 8 years ago

Here's a repo that reproduces the error: https://github.com/petertodd/gitpython-unicode-error

The filename in question is 'v\udcbb', and print('v\udcbb') raises "UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbb' in position 1: surrogates not allowed", so likely there's some kind of unicode-related problem here.

ankostis commented 8 years ago

Thanks @petertodd. What platform and what gitpython version?

petertodd commented 8 years ago

Oh sorry, all those details are in the commit message for the test repo: https://github.com/petertodd/gitpython-unicode-error/commit/e8a5d25632b13c15bbc33e0f6323f86a673b220f

Byron commented 8 years ago

Thanks a lot for the easy way reproduce the issue ! The error I get on OSX seems to be a different one, but in any case, it should be fixed:

Traceback (most recent call last):
  File "./test.py", line 9, in <module>
    tuple(tree)
  File "/Users/byron/dev/py/GitPython/git/objects/tree.py", line 207, in _iter_convert_to_object
    path = join_path(self.path, name)
  File "/Users/byron/dev/py/GitPython/git/util.py", line 125, in join_path
    if b.startswith('/'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

Byron commented 8 years ago

I have had a look, and just want to leave my findings here in case someone else wants to fix it for good. The problem GitPython generally has is its attempt to convert something it thinks could be paths to unicode. It does so with the best intentions, but causes more issues than it solves.

The current tree implementation tries to decode bytes as unicode, but simply returns bytes if that fails. Thus the type returned is bytes|unicode. Code using that later, like join_path would assume a unicode object, and fails in otherwise.

What I tried to do is to remove the conditional behaviour in tree and make join_path work with bytes. My problem ended up to be that in python, there is no way to just 'reinterpretunicode as bytes - one always has to go through an encoding step, which requires an encoding GitPython doesn't even know. Probably there are encodings out there which don't encode/`or`` as we know it in ascii, but as far as GitPython is concerned, it's either utf-8 or ascii. It's a bit of a corner GitPython seems to be in, which might require more work to get out of.

ankostis commented 8 years ago

The situation is further complicated by the transition from Py2-->PY3, I know. But for the specific problem you describe, surrogate escaping (PEP 383) seems like a perfect fit to this case. So instead of using plain utf-8, just use decode('utf-8', 'surrogateescape').

[edit] Note that in Windows there are 2 kinds of errors still remaining: resource leaks and unicode, so if you fix that, code-coverage will increase significantly also in this platform.

Byron commented 8 years ago

@ankostis Thanks for the tip! That seems to be exactly what I was looking for ! Now I am curious whether the barrage of CI nodes still like the project, my local testing was certainly spotty.

ankostis commented 8 years ago

Wow, amazingly fast (finding also a backported PEP 383).

Byron commented 8 years ago

Fast, but not good :)! I broke the build - am on it now, let's hope I can do it in the remaining 30m I have for today.

petertodd commented 8 years ago

Awesome! Thanks for the quick response; looking forward to the next release.

petertodd commented 8 years ago

FWIW, this is the software I'm using GitPython for:

https://petertodd.org/2016/opentimestamps-announcement

specifically Git commit timestamping:

https://petertodd.org/2016/opentimestamps-git-integration

gitpython-developers / GitPython

Crash with unicode filename and Python3 #532