Closed jorio closed 1 year ago
If instead of calling str/repr we access the name, then we get an exception:
>>> signature.name
[...]
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence
A similar error is raised if trying to get the message:
>>> commit.message
[...]
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 22: illegal multibyte sequence
For coherence should not the same error be raised when calling str/repr?
For reference I wrote this test script:
from pathlib import Path
import pygit2
def test_commit(commit_id):
commit = repo.get(commit_id)
print(f'{commit_id} encoding={commit.message_encoding}')
signature = commit.author
try:
print('mail', signature.email)
except UnicodeDecodeError:
print('raw ', signature.raw_mail)
print('mail', signature.raw_mail.decode('utf-8'))
try:
print('name', signature.name)
except UnicodeDecodeError:
print('raw ', signature.raw_name)
print('name', signature.raw_name.decode('utf-8'))
try:
print(commit.message)
except UnicodeDecodeError:
print(commit.raw_message)
print(commit.raw_message.decode('utf-8'))
print('STR ', str(signature))
print('REPR', repr(signature))
print()
if __name__ == '__main__':
name = 'javaWeb-bookManagementSystem'
if Path('javaWeb-bookManagementSystem').exists():
repo = pygit2.Repository(name)
else:
print('Clone repo...')
url = f'https://github.com/LJF2402901363/{name}.git'
repo = pygit2.clone_repository(url, name)
print('OK')
test_commit('7dc18ea6a765f778f136efa87c28eadef583ad60') # no encoding, defaults to utf-8, is utf-8 (OK)
test_commit('3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d') # claims GBK but it's UTF8
test_commit('4d47b509cdb3237cc5ea6841cfd707d7e55f8522') # claims GBK, msg is GBK, sigs are UTF-8
For coherence should not the same error be raised when calling str/repr?
Good point.
My thinking was that repr can be automatically invoked by debuggers and interpreters. In that context, a 'permissive' repr lets you know at a glance if there's anything salvageable in a signature (e.g. email, time) rather than erroring out completely.
Either way, I don't feel strongly about raising an error or not, as long as we stop segfaulting ;)
Back...
Ok, agree with you. There is only a small regression, with the test script above (that I've updated).
In master the output is:
$ python test_issue1205.py
7dc18ea6a765f778f136efa87c28eadef583ad60 encoding=None
mail 58659173+LJF2402901363@users.noreply.github.com
name 陌意随影
Update pom.xml
修复了fastjson版本过低存在的漏洞问题。
STR 陌意随影 <58659173+LJF2402901363@users.noreply.github.com>
REPR pygit2.Signature('陌意随影', '58659173+LJF2402901363@users.noreply.github.com', 1641446748, 480, None)
3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d encoding=GBK
mail 2402901363@qq.com
raw b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
b'\xe8\xa7\xa3\xe5\x86\xb3\xe4\xba\x86\xe5\x9c\xa8\xe6\xb7\xbb\xe5\x8a\xa0\xe8\xaf\xbb\xe8\x80\x85\xe5\x92\x8c\xe6\x9b\xb4\xe6\x96\xb0\xe7\x94\xa8\xe6\x88\xb7\xe6\x97\xb6\xe5\x8f\x91\xe7\x94\x9f\xe7\xa9\xba\xe6\x8c\x87\xe9\x92\x88\xe5\xbc\x82\xe5\xb8\xb8\xe7\x9a\x84\xe6\x83\x85\xe5\x86\xb5\n'
解决了在添加读者和更新用户时发生空指针异常的情况
Segmentation fault (core dumped)
With this PR:
$ python test_issue1205.py
7dc18ea6a765f778f136efa87c28eadef583ad60 encoding=None
mail 58659173+LJF2402901363@users.noreply.github.com
name 陌意随影
Update pom.xml
修复了fastjson版本过低存在的漏洞问题。
STR 陌意随影 <58659173+LJF2402901363@users.noreply.github.com>
REPR pygit2.Signature('陌意随影', '58659173+LJF2402901363@users.noreply.github.com', 1641446748, 480, 'None')
3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d encoding=GBK
mail 2402901363@qq.com
raw b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
b'\xe8\xa7\xa3\xe5\x86\xb3\xe4\xba\x86\xe5\x9c\xa8\xe6\xb7\xbb\xe5\x8a\xa0\xe8\xaf\xbb\xe8\x80\x85\xe5\x92\x8c\xe6\x9b\xb4\xe6\x96\xb0\xe7\x94\xa8\xe6\x88\xb7\xe6\x97\xb6\xe5\x8f\x91\xe7\x94\x9f\xe7\xa9\xba\xe6\x8c\x87\xe9\x92\x88\xe5\xbc\x82\xe5\xb8\xb8\xe7\x9a\x84\xe6\x83\x85\xe5\x86\xb5\n'
解决了在添加读者和更新用户时发生空指针异常的情况
STR 淇�閿� 姊� <2402901363@qq.com>
REPR pygit2.Signature('淇�閿� 姊�', '2402901363@qq.com', 1608521800, 480, 'GBK')
4d47b509cdb3237cc5ea6841cfd707d7e55f8522 encoding=GBK
mail 2402901363@qq.com
raw b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
更新了启动项目页面直接跳转到首页不用再自动输入对应的网址。
STR 淇�閿� 姊� <2402901363@qq.com>
REPR pygit2.Signature('淇�閿� 姊�', '2402901363@qq.com', 1603540603, 480, 'GBK')
The regression is in the representation for the first commit, the one that is correct.
The encoding appears as 'None'
instead of None
.
Thanks @jorio for another contribution
Sorry @jdavid, I had been meaning to take a look at 'None'
but I haven't been able to make time for it before you fixed it up. Thank you!
Please see issue #1205 for details about this bug.
The proposed fix avoids undefined behavior by never passing NULL to PyUnicode_FromFormat.