Open StyXman opened 9 years ago
Thanks for posting this issue !
It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected. Maybe it is safer to not attempt to decode anything, and leave that to the client, who could read the bytes of the associated tag-object and parse them with a suitable encoding in mind.
Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string.
In fact, no, tag.object.hexsha
is not what I'm looking for.
More data: technically this is an encoding error in the data itself:
In [11]: stream= pricing.odb.stream(tag.object.binsha)
In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <jdoe@megacorp.com> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n'
You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info...
Using the tag.object
it should be straightforward to obtain the raw-bytes stored in the tag-object, in case this is what you are actually looking for. Those represent a few formatted lines of information, which could be parsed with code similar to the one currently in use.
Parsing can only safely operate on bytes though, as the encoding seems not to be UTF-8 at all times.
Even if parsing is made to work at some point, right now the tagger-name are expected to be str/unicode
instances, which couldn't be obtained if the encoding of the underlying bytes are unknown.
What about using decode(defenc, 'ignore')
? I hope it doesn't break anything else. I'll try that locally.
Great idea ! Of course it's questionable whether the program should silently drop information, instead of loudly abort operation as it currently does. It seems that it's generally unwise to make assumptions about the encoding in TagObjects, so the implementation should leave it to the client to deal with that and provide byte-strings only.
But that would be against your policy of handling as much as possible as unicode
(if I correctly understood #312)...
BTW, that fixed my particular problem, but I guess you don't want the PR just yet...
But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example. The truth is that I am not entirely sure how git itself handles encodings, and it might be that GitPython actually went down a wrong path by trying to just decode textual data as UTF-8. The latter works most of the time, but that's not really good enough.
Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether .decode(defenc, 'ignore')
is acceptable.
Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(.
I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding.
I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows:
File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update
diff = self._repo.git.diff('--full-index', self._mainline, self._rebased)
File "/usr/lib/python2.7/site-packages/git/cmd.py", line 431, in
Will this issue include my error or I need to create another one? Maybe you could help me with the solution?
@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected.
@CepGamer You can pass the stdout_as_string=False
keyword argument when executing .git.diff
(i.e. .git.diff(..., stdout_as_string=False)
), or use GitPython's own diffing facilities.
I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown:
ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
self.message = self.message.decode(self.encoding)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte
Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository. Perhaps an option would be to provide an option to disable decoding?
A new release was just made to pypi 😁 (see #298) !
Git does store the encoding, if I understand this correctly: https://git-scm.com/docs/git-commit (Discussion section pretty much at the bottom). The relevant statements from that section are:
The way to say this [i.e. the encoding] is to have i18n.commitencoding
in .git/config
file, like this:
[i18n]
commitencoding = ISO-8859-1
i18n.commitencoding
in its encoding
header. This is to help other people who look at them later. Lack of this header implies that the commit log message is encoded in UTF-8.The last statement in the list above is the key here. Assuming the GitPython code can access the encoding header (sorry, I'm new to GitPython development), it can safely determine the encoding, because lack of the header specifically means UTF-8. That could then be specified as the encoding in the decode()
call that failed in this issue here. I think that would be superior to an approach that treats the commit message as bytes.
That still does not address the original issue of illegal characters in the encoding that was used. That could be addressed by using errors='replace'
in the decode()
call.
I fully support what @andy-maier just said here.
:-( I stumbled upon this one too. I was waiting for the v2.0.8 release in hope of the fix. Please take this one seriously, if possible, for the v2.0.9.
The .decode()
call on git/objects/tag.py:56 should get a 'replace'
arg to fix this issue. @ppietrasa can you try out that fix and report if that works? If so, can you make a PR for it?
I have run into this problem.
This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug:
https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f
Using version 2.1.11
I have the same essue, when reading branches property, how to solve it?
I have the same essue, when reading branches property, how to solve it?
repo = Repo(r'') print(repo.branches)
I have a similar question.
Traceback (most recent call last):
File "D:/Python/src/post.py", line 16, in <module>
print(repo.branches)
File "D:\Programs\Python37\lib\site-packages\git\repo\base.py", line 289, in heads
return Head.list_items(self)
File "D:\Programs\Python37\lib\site-packages\git\util.py", line 922, in list_items
out_list.extend(cls.iter_items(repo, *args, **kwargs))
File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 616, in _iter_items
for _sha, rela_path in cls._iter_packed_refs(repo):
File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 91, in _iter_packed_refs
for line in fp:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 538: illegal multibyte sequence
I have a similar question too🥲
pr_repo = g.get_repo(repo_name)
"/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader
values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)
I'm not sure this is a proper way to use
TagReferences
, but it's definitely unexpected. This time I'm usingGitPython
installed by pypi.I have this nice tag:
I can get a lot of info out of it:
But this fails:
Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using
tag.object.hexsha
, which is what I wanted.