Closed petertodd closed 8 years ago
Thanks @petertodd. What platform and what gitpython version?
Oh sorry, all those details are in the commit message for the test repo: https://github.com/petertodd/gitpython-unicode-error/commit/e8a5d25632b13c15bbc33e0f6323f86a673b220f
Thanks a lot for the easy way reproduce the issue ! The error I get on OSX seems to be a different one, but in any case, it should be fixed:
Traceback (most recent call last):
File "./test.py", line 9, in <module>
tuple(tree)
File "/Users/byron/dev/py/GitPython/git/objects/tree.py", line 207, in _iter_convert_to_object
path = join_path(self.path, name)
File "/Users/byron/dev/py/GitPython/git/util.py", line 125, in join_path
if b.startswith('/'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
I have had a look, and just want to leave my findings here in case someone else wants to fix it for good. The problem GitPython generally has is its attempt to convert something it thinks could be paths to unicode. It does so with the best intentions, but causes more issues than it solves.
The current tree implementation tries to decode bytes as unicode, but simply returns bytes if that fails. Thus the type returned is bytes|unicode
. Code using that later, like join_path
would assume a unicode
object, and fails in otherwise.
What I tried to do is to remove the conditional behaviour in tree
and make join_path
work with bytes. My problem ended up to be that in python, there is no way to just 'reinterpretunicode as bytes - one always has to go through an encoding step, which requires an encoding GitPython doesn't even know. Probably there are encodings out there which don't encode
/`or`` as we know it in ascii, but as far as GitPython is concerned, it's either utf-8 or ascii.
It's a bit of a corner GitPython seems to be in, which might require more work to get out of.
The situation is further complicated by the transition from Py2-->PY3, I know.
But for the specific problem you describe, surrogate escaping (PEP 383) seems like a perfect fit to this case.
So instead of using plain utf-8
, just use decode('utf-8', 'surrogateescape')
.
[edit] Note that in Windows there are 2 kinds of errors still remaining: resource leaks and unicode, so if you fix that, code-coverage will increase significantly also in this platform.
@ankostis Thanks for the tip! That seems to be exactly what I was looking for ! Now I am curious whether the barrage of CI nodes still like the project, my local testing was certainly spotty.
Wow, amazingly fast (finding also a backported PEP 383).
Fast, but not good :)! I broke the build - am on it now, let's hope I can do it in the remaining 30m I have for today.
Awesome! Thanks for the quick response; looking forward to the next release.
FWIW, this is the software I'm using GitPython for:
https://petertodd.org/2016/opentimestamps-announcement
specifically Git commit timestamping:
Here's a repo that reproduces the error: https://github.com/petertodd/gitpython-unicode-error
The filename in question is 'v\udcbb', and print('v\udcbb') raises "UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbb' in position 1: surrogates not allowed", so likely there's some kind of unicode-related problem here.