python / cpython

The Python programming language
https://www.python.org
Other
63.05k stars 30.2k forks source link

socket.getfqdn() UnicodeDecodeError depending on LANG variable #93251

Open cpina opened 2 years ago

cpina commented 2 years ago

Bug report

This code:

import locale
import socket

locale.setlocale(locale.LC_ALL, '')

socket.getfqdn()

Raise an exception if running it like this:

LANG=ru_RU.CP1251 /opt/Python-3.9.2/bin/python3 bug.py

Note the LANG. I haven't checked for which "LANG" this works or fails.

:warning: : to exercise the problematic code (see comments for details on the problematic code path) the hostname should not be resolvable (so not in /etc/hosts, not resolvable via DNS or other methods up to /etc/nsswitch.conf hosts settings). The hostname, to reproduce the problem, can be changed on Linux via sudo hostname something-that-does-not-exist.

Traceback (most recent call last):
  File "/root/t/prova.py", line 7, in <module>
    socket.getfqdn()
  File "/opt/Python-3.9.2/lib/python3.9/socket.py", line 791, in getfqdn
    hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0: invalid continuation byte

Your environment

Tested this on a Debian 11 bullseye with the the following Python interpreters:

I've encountered this bug in two independent Debian installations (with different locale settings) and in a CI system (also Debian based but unrelated settings).

Only tested in x64 systems.

cpina commented 2 years ago

In case that it helps, the stacktrace before hits the line:

errmsg = "invalid continuation byte";

In Objects/unicodeobject.c function unicode_decode_utf8.

Backtrace:

#0  unicode_decode_utf8 (s=0x555555a2e8e0 "����������� ��� ��� ������", size=26, error_handler=_Py_ERROR_UNKNOWN, errors=0x0, consumed=0x0)
    at Objects/unicodeobject.c:5069
#1  0x00005555556348c4 in PyUnicode_DecodeUTF8Stateful (s=0x555555a2e8e0 "����������� ��� ��� ������", size=26, errors=0x0, consumed=0x0)
    at Objects/unicodeobject.c:5141
#2  0x0000555555629dae in PyUnicode_FromStringAndSize (u=0x555555a2e8e0 "����������� ��� ��� ������", size=26) at Objects/unicodeobject.c:2267
#3  0x00005555556a0064 in do_mkvalue (p_format=0x7fffffff73b8, p_va=0x7fffffff73a0, flags=1) at Python/modsupport.c:423
#4  0x000055555569f5cd in do_mktuple (p_format=0x7fffffff73b8, p_va=0x7fffffff73a0, endchar=41 ')', n=2, flags=1) at Python/modsupport.c:264
#5  0x000055555569f737 in do_mkvalue (p_format=0x7fffffff73b8, p_va=0x7fffffff73a0, flags=1) at Python/modsupport.c:289
#6  0x00005555556a06ac in va_build_value (format=0x7ffff79bf942 "(is)", va=0x7fffffff73f0, flags=1) at Python/modsupport.c:562
#7  0x00005555556a05b0 in _Py_BuildValue_SizeT (format=0x7ffff79bf942 "(is)") at Python/modsupport.c:530
#8  0x00007ffff79b3a91 in set_gaierror (error=-2) at /root/python/Python-3.9.2/Modules/socketmodule.c:680
#9  0x00007ffff79b43b2 in setipaddr (name=0x7ffff7b6bb90 "reprotest-capture-hostname", addr_ret=0x7fffffffb600, addr_ret_size=128, af=0)
    at /root/python/Python-3.9.2/Modules/socketmodule.c:1211
#10 0x00007ffff79bada7 in socket_gethostbyaddr (self=0x7ffff79de220, args=0x7ffff7b64940) at /root/python/Python-3.9.2/Modules/socketmodule.c:5822

Ignore the line numbers - In some files I had added some debug information.

I wonder (but I cannot reproduce outside Python) if the handling of the result of set_gaierror is what is causing errors depending on the locale settings.

cpina commented 2 years ago

If it helps, gai_strerror is called (in set_gaierror) and might return a localised error:

root@reprotest-capture-hostname:~/t# cat bug.py 

import locale
import socket

locale.setlocale(locale.LC_ALL, '')

print('test')

socket.getfqdn()
root@reprotest-capture-hostname:~/t# ./a.out 
test
gai_strerror: Name or service not known
root@reprotest-capture-hostname:~/t# LANG=ru_RU.CP1251 ./a.out 
test
gai_strerror: ����������� ��� ��� ������
root@reprotest-capture-hostname:~/t# 

In set_gaierror there is:

    v = Py_BuildValue("(is)", error, gai_strerror(error));

With the russian locale (and I suspect that other locales) it seems that when using PyUnicode_FromString via Py_BuildValue it cannot create the PyUnicode (see the original post) and it all fails.

Hopefully this helps to find the error.

goeranu commented 4 weeks ago

The problem remains in the prerelease of Python 3.13 coming with Fedora 41 beta currently.

We just hit this using the Latin-1 Swedish locale sv_SE and a call of socket.gethostbyaddr with an argument that replies with an error code. That results in a UnicodeDecodeError exception rather than the expected socket.herror exception.

Our analysis came to the same conclusion as above, with the added little detail that the format string s according to the documentation does indeed interpret the string as UTF-8. But that is not what gai_strerror returns.