Fix segfault in Signature.__repr__ & __str__ when the signature's encoding is incorrect or unknown (#1205)

jorio commented 1 year ago

Please see issue #1205 for details about this bug.

The proposed fix avoids undefined behavior by never passing NULL to PyUnicode_FromFormat.

jdavid commented 1 year ago

If instead of calling str/repr we access the name, then we get an exception:

>>> signature.name
[...]
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence

A similar error is raised if trying to get the message:

>>> commit.message
[...]
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 22: illegal multibyte sequence

For coherence should not the same error be raised when calling str/repr?

jdavid commented 1 year ago

For reference I wrote this test script:

from pathlib import Path
import pygit2

def test_commit(commit_id):
    commit = repo.get(commit_id)
    print(f'{commit_id} encoding={commit.message_encoding}')

    signature = commit.author

    try:
        print('mail', signature.email)
    except UnicodeDecodeError:
        print('raw ', signature.raw_mail)
        print('mail', signature.raw_mail.decode('utf-8'))

    try:
        print('name', signature.name)
    except UnicodeDecodeError:
        print('raw ', signature.raw_name)
        print('name', signature.raw_name.decode('utf-8'))

    try:
        print(commit.message)
    except UnicodeDecodeError:
        print(commit.raw_message)
        print(commit.raw_message.decode('utf-8'))

    print('STR ', str(signature))
    print('REPR', repr(signature))
    print()

if __name__ == '__main__':
    name = 'javaWeb-bookManagementSystem'
    if Path('javaWeb-bookManagementSystem').exists():
        repo = pygit2.Repository(name)
    else:
        print('Clone repo...')
        url = f'https://github.com/LJF2402901363/{name}.git'
        repo = pygit2.clone_repository(url, name)
        print('OK')

    test_commit('7dc18ea6a765f778f136efa87c28eadef583ad60')  # no encoding, defaults to utf-8, is utf-8 (OK)
    test_commit('3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d')  # claims GBK but it's UTF8
    test_commit('4d47b509cdb3237cc5ea6841cfd707d7e55f8522')  # claims GBK, msg is GBK, sigs are UTF-8

jorio commented 1 year ago

For coherence should not the same error be raised when calling str/repr?

Good point.

My thinking was that repr can be automatically invoked by debuggers and interpreters. In that context, a 'permissive' repr lets you know at a glance if there's anything salvageable in a signature (e.g. email, time) rather than erroring out completely.

Either way, I don't feel strongly about raising an error or not, as long as we stop segfaulting ;)

jdavid commented 1 year ago

Back...

Ok, agree with you. There is only a small regression, with the test script above (that I've updated).

In master the output is:

$ python test_issue1205.py 
7dc18ea6a765f778f136efa87c28eadef583ad60 encoding=None
mail 58659173+LJF2402901363@users.noreply.github.com
name 陌意随影
Update pom.xml

修复了fastjson版本过低存在的漏洞问题。
STR  陌意随影 <58659173+LJF2402901363@users.noreply.github.com>
REPR pygit2.Signature('陌意随影', '58659173+LJF2402901363@users.noreply.github.com', 1641446748, 480, None)

3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d encoding=GBK
mail 2402901363@qq.com
raw  b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
b'\xe8\xa7\xa3\xe5\x86\xb3\xe4\xba\x86\xe5\x9c\xa8\xe6\xb7\xbb\xe5\x8a\xa0\xe8\xaf\xbb\xe8\x80\x85\xe5\x92\x8c\xe6\x9b\xb4\xe6\x96\xb0\xe7\x94\xa8\xe6\x88\xb7\xe6\x97\xb6\xe5\x8f\x91\xe7\x94\x9f\xe7\xa9\xba\xe6\x8c\x87\xe9\x92\x88\xe5\xbc\x82\xe5\xb8\xb8\xe7\x9a\x84\xe6\x83\x85\xe5\x86\xb5\n'
解决了在添加读者和更新用户时发生空指针异常的情况

Segmentation fault (core dumped)

With this PR:

$ python test_issue1205.py
7dc18ea6a765f778f136efa87c28eadef583ad60 encoding=None
mail 58659173+LJF2402901363@users.noreply.github.com
name 陌意随影
Update pom.xml

修复了fastjson版本过低存在的漏洞问题。
STR  陌意随影 <58659173+LJF2402901363@users.noreply.github.com>
REPR pygit2.Signature('陌意随影', '58659173+LJF2402901363@users.noreply.github.com', 1641446748, 480, 'None')

3e9d6b6f06d5abc25dd2a5b1b0f9fae10b09c20d encoding=GBK
mail 2402901363@qq.com
raw  b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
b'\xe8\xa7\xa3\xe5\x86\xb3\xe4\xba\x86\xe5\x9c\xa8\xe6\xb7\xbb\xe5\x8a\xa0\xe8\xaf\xbb\xe8\x80\x85\xe5\x92\x8c\xe6\x9b\xb4\xe6\x96\xb0\xe7\x94\xa8\xe6\x88\xb7\xe6\x97\xb6\xe5\x8f\x91\xe7\x94\x9f\xe7\xa9\xba\xe6\x8c\x87\xe9\x92\x88\xe5\xbc\x82\xe5\xb8\xb8\xe7\x9a\x84\xe6\x83\x85\xe5\x86\xb5\n'
解决了在添加读者和更新用户时发生空指针异常的情况

STR  淇�閿� 姊� <2402901363@qq.com>
REPR pygit2.Signature('淇�閿� 姊�', '2402901363@qq.com', 1608521800, 480, 'GBK')

4d47b509cdb3237cc5ea6841cfd707d7e55f8522 encoding=GBK
mail 2402901363@qq.com
raw  b'\xe4\xbf\xad\xe9\x94\x8b \xe6\xa2\x81'
name 俭锋 梁
更新了启动项目页面直接跳转到首页不用再自动输入对应的网址。

STR  淇�閿� 姊� <2402901363@qq.com>
REPR pygit2.Signature('淇�閿� 姊�', '2402901363@qq.com', 1603540603, 480, 'GBK')

The regression is in the representation for the first commit, the one that is correct. The encoding appears as 'None' instead of None.

jdavid commented 1 year ago

Thanks @jorio for another contribution

jorio commented 1 year ago

Sorry @jdavid, I had been meaning to take a look at 'None' but I haven't been able to make time for it before you fixed it up. Thank you!

libgit2 / pygit2

Fix segfault in Signature.repr & str when the signature's encoding is incorrect or unknown (#1205) #1210

libgit2 / pygit2

Fix segfault in Signature.__repr__ & __str__ when the signature's encoding is incorrect or unknown (#1205) #1210

Fix segfault in Signature.repr & str when the signature's encoding is incorrect or unknown (#1205) #1210