Encoding problem - Githubissues

StyXman commented 9 years ago

I'm not sure this is a proper way to use TagReferences, but it's definitely unexpected. This time I'm using GitPython installed by pypi.

I have this nice tag:

In [8]: tag
Out[8]: <git.TagReference "refs/tags/PROMOTED_1501131729_MKT15_01_12_QU_1">

I can get a lot of info out of it:

In [9]: tag.object.hexsha
Out[9]: u'dca63c5c7e6aab3cd4934e60230ec3419ab87071'

In [12]: tag.name
Out[12]: 'PROMOTED_1501131729_MKT15_01_12_QU_1'

In [13]: tag.object
Out[13]: <git.TagObject "dca63c5c7e6aab3cd4934e60230ec3419ab87071">

In [14]: tag.ref
TypeError: PROMOTED_1501131729_MKT15_01_12_QU_1 is a detached symbolic reference as it points to 'dca63c5c7e6aab3cd4934e60230ec3419ab87071'

But this fails:

In [15]: tag.commit
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-2431a6e80cf9> in <module>()
----> 1 tag.commit

/home/mdione/local/lib/python2.7/site-packages/git/refs/tag.pyc in commit(self)
     29         elif obj.type == "tag":
     30             # it is a tag object which carries the commit as an object - we can point to anything
---> 31             return obj.object
     32         else:
     33             raise ValueError("Tag %s points to a Blob or Tree - have never seen that before" % self)

/home/mdione/local/lib/python2.7/site-packages/gitdb/util.pyc in __getattr__(self, attr)
--> 237         self._set_cache_(attr)
    238         # will raise in case the cache was not created
    239         return object.__getattribute__(self, attr)

/home/mdione/local/lib/python2.7/site-packages/git/objects/tag.pyc in _set_cache_(self, attr)
     54         if attr in TagObject.__slots__:
     55             ostream = self.repo.odb.stream(self.binsha)
---> 56             lines = ostream.read().decode(defenc).splitlines()
     57
     58             obj, hexsha = lines[0].split(" ")       # object <hexsha>

/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa8 in position 108: invalid start byte

Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using tag.object.hexsha, which is what I wanted.

Byron commented 9 years ago

Thanks for posting this issue !

It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected. Maybe it is safer to not attempt to decode anything, and leave that to the client, who could read the bytes of the associated tag-object and parse them with a suitable encoding in mind.

Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string.

StyXman commented 9 years ago

In fact, no, tag.object.hexsha is not what I'm looking for.

StyXman commented 9 years ago

More data: technically this is an encoding error in the data itself:

In [11]: stream= pricing.odb.stream(tag.object.binsha)

In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <jdoe@megacorp.com> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n'

You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info...

Byron commented 9 years ago

Using the tag.object it should be straightforward to obtain the raw-bytes stored in the tag-object, in case this is what you are actually looking for. Those represent a few formatted lines of information, which could be parsed with code similar to the one currently in use. Parsing can only safely operate on bytes though, as the encoding seems not to be UTF-8 at all times.

Even if parsing is made to work at some point, right now the tagger-name are expected to be str/unicode instances, which couldn't be obtained if the encoding of the underlying bytes are unknown.

StyXman commented 9 years ago

What about using decode(defenc, 'ignore')? I hope it doesn't break anything else. I'll try that locally.

Byron commented 9 years ago

Great idea ! Of course it's questionable whether the program should silently drop information, instead of loudly abort operation as it currently does. It seems that it's generally unwise to make assumptions about the encoding in TagObjects, so the implementation should leave it to the client to deal with that and provide byte-strings only.

StyXman commented 9 years ago

But that would be against your policy of handling as much as possible as unicode (if I correctly understood #312)...

StyXman commented 9 years ago

BTW, that fixed my particular problem, but I guess you don't want the PR just yet...

Byron commented 9 years ago

But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example. The truth is that I am not entirely sure how git itself handles encodings, and it might be that GitPython actually went down a wrong path by trying to just decode textual data as UTF-8. The latter works most of the time, but that's not really good enough.

Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether .decode(defenc, 'ignore') is acceptable.

Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(.

StyXman commented 9 years ago

I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding.

CepGamer commented 9 years ago

I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows:

File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update diff = self._repo.git.diff('--full-index', self._mainline, self._rebased) File "/usr/lib/python2.7/site-packages/git/cmd.py", line 431, in return lambda _args, _kwargs: self._call_process(name, _args, _kwargs) File "/usr/lib/python2.7/site-packages/git/cmd.py", line 802, in _call_process return self.execute(make_call(), **_kwargs) File "/usr/lib/python2.7/site-packages/git/cmd.py", line 610, in execute stdout_value = stdout_value.decode(defenc) File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 225: invalid continuation byte

Will this issue include my error or I need to create another one? Maybe you could help me with the solution?

Byron commented 9 years ago

@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected.

@CepGamer You can pass the stdout_as_string=False keyword argument when executing .git.diff (i.e. .git.diff(..., stdout_as_string=False)), or use GitPython's own diffing facilities.

maikelsteneker commented 8 years ago

I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown:

ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
    self.message = self.message.decode(self.encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte

Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository. Perhaps an option would be to provide an option to disable decoding?

Byron commented 8 years ago

A new release was just made to pypi 😁 (see #298) !

andy-maier commented 8 years ago

Git does store the encoding, if I understand this correctly: https://git-scm.com/docs/git-commit (Discussion section pretty much at the bottom). The relevant statements from that section are:

The contents of the blob objects are uninterpreted sequences of bytes. There is no encoding translation at the core level.
Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.).
The way to say this [i.e. the encoding] is to have i18n.commitencoding in .git/config file, like this:
```
[i18n]
  commitencoding = ISO-8859-1
```
Commit objects created with the above setting record the value of i18n.commitencoding in its encoding header. This is to help other people who look at them later. Lack of this header implies that the commit log message is encoded in UTF-8.

The last statement in the list above is the key here. Assuming the GitPython code can access the encoding header (sorry, I'm new to GitPython development), it can safely determine the encoding, because lack of the header specifically means UTF-8. That could then be specified as the encoding in the decode() call that failed in this issue here. I think that would be superior to an approach that treats the commit message as bytes.

That still does not address the original issue of illegal characters in the encoding that was used. That could be addressed by using errors='replace' in the decode() call.

nvie commented 8 years ago

I fully support what @andy-maier just said here.

ppietrasa commented 8 years ago

:-( I stumbled upon this one too. I was waiting for the v2.0.8 release in hope of the fix. Please take this one seriously, if possible, for the v2.0.9.

nvie commented 8 years ago

The .decode() call on git/objects/tag.py:56 should get a 'replace' arg to fix this issue. @ppietrasa can you try out that fix and report if that works? If so, can you make a PR for it?

sbenthall commented 5 years ago

I have run into this problem.

This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug:

https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f

Using version 2.1.11

brizjin commented 5 years ago

I have the same essue, when reading branches property, how to solve it?

ViCrack commented 4 years ago

I have the same essue, when reading branches property, how to solve it?

repo = Repo(r'') print(repo.branches) I have a similar question.

Traceback (most recent call last):
  File "D:/Python/src/post.py", line 16, in <module>
    print(repo.branches)
  File "D:\Programs\Python37\lib\site-packages\git\repo\base.py", line 289, in heads
    return Head.list_items(self)
  File "D:\Programs\Python37\lib\site-packages\git\util.py", line 922, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 616, in _iter_items
    for _sha, rela_path in cls._iter_packed_refs(repo):
  File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 91, in _iter_packed_refs
    for line in fp:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 538: illegal multibyte sequence

zeze1004 commented 2 years ago

I have a similar question too🥲

pr_repo = g.get_repo(repo_name)

"/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader
    values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)

Details

File "/Users/mac/project/kerraform/./auto_git_api.py", line 80, in pull_request pr_repo = g.get_repo(repo_name) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/MainClass.py", line 330, in get_repo headers, data = self.__requester.requestJsonAndCheck("GET", url) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 354, in requestJsonAndCheck *self.requestJson( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 454, in requestJson return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 528, in __requestEncode status, responseHeaders, output = self.__requestRaw( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 555, in __requestRaw response = cnx.getresponse() File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 127, in getresponse r = verb( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in get return self.request('GET', url, **kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1285, in request self._send_request(method, url, body, headers, encode_chunked) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1326, in _send_request self.putheader(hdr, value) File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 224, in putheader _HTTPConnection.putheader(self, header, *values) File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader values[i] = one_value.encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)

gitpython-developers / GitPython

Encoding problem #332