python / cpython

The Python programming language
https://www.python.org
Other
63.46k stars 30.39k forks source link

socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

Open 50cef828-0e94-47a2-843d-cb13c9fb9120 opened 14 years ago

50cef828-0e94-47a2-843d-cb13c9fb9120 commented 14 years ago
BPO 9377
Nosy @malemburg, @loewis, @amauryfa, @vstinner, @ezio-melotti, @bitdancer, @zooba
Files
  • ascii-surrogateescape.diff: Decode hostnames as ASCII/surrogateescape rather than UTF-8
  • try-surrogateescape-first.diff: Accept ASCII/surrogateescape strings as hostname arguments
  • uname-surrogateescape.diff: In posix.uname(), decode nodename as ASCII/surrogateescape
  • ascii-surrogateescape-2.diff: Renamed unicode_from_hostname -> decode_hostname
  • try-surrogateescape-first-2.diff: Made various small changes
  • try-surrogateescape-first-3.diff: Fixed a couple of mistakes
  • try-surrogateescape-first-4.diff
  • try-surrogateescape-first-getnameinfo-4.diff
  • decode-strict-ascii.diff: Decode hostnames strictly as ASCII
  • hostname-bytes-apis.diff: Add name resolution APIs that return names as bytes (applies on top of decode-strict-ascii.diff)
  • return-ascii-surrogateescape-2015-06-25.diff
  • accept-ascii-surrogateescape-2015-06-25.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'type-bug'] title = 'socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names' updated_at = user = 'https://bugs.python.org/baikie' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'baikie' dependencies = [] files = ['18195', '18196', '18259', '18272', '18273', '18609', '18616', '18617', '18674', '18676', '39812', '39813'] hgrepos = [] issue_num = 9377 keywords = ['patch'] message_count = 52.0 messages = ['111550', '111766', '111985', '112094', '114688', '114710', '114754', '114756', '114847', '114882', '115014', '115030', '115116', '115119', '115185', '115186', '115187', '118582', '118602', '118617', '118694', '118709', '118816', '118952', '119051', '119076', '119177', '119230', '119231', '119245', '119260', '119271', '119346', '119837', '119918', '119925', '119927', '119928', '119929', '119935', '119941', '119943', '119946', '120081', '158118', '158165', '158175', '158178', '159776', '243311', '245826', '259079'] nosy_count = 11.0 nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'vstinner', 'baikie', 'ezio.melotti', 'r.david.murray', 'jesterKing', 'spaun2002', 'steve.dower', 'Almad'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue9377' versions = ['Python 3.2'] ```

    50cef828-0e94-47a2-843d-cb13c9fb9120 commented 9 years ago

    I've updated the ASCII/surrogateescape patches in line with various changes to Python since I posted them.

    return-ascii-surrogateescape-2015-06-25.diff incorporates the ascii-surrogateescape and uname-surrogateescape patches, and accept-ascii-surrogateescape-2015-06-25.diff corresponds to the try-surrogateescape-first patch. Neither patch touches gethostname() on Windows.

    Python's existing code now has a fast path for ASCII-only strings which passes them through unchanged (Unicode -> ASCII), so in order not to slow down processing of valid IDNs, the latter patch now effectively tries encodings in the order

    ASCII/strict (existing code, fast path) IDNA/strict (existing code) ASCII/surrogateescape (added by patch)

    rather than the previous

    ASCII/surrogateescape IDNA/strict

    This doesn't change the behaviour of the patch, since IDNA always rejects strings containing surrogate codes, and either rejects ASCII-only strings (e.g. when a label is longer than 63 characters) or passes them through unchanged.

    These patches would at least allow getfqdn() to work in Almad's example, but in that case the host also appears to be addressable by the IDNA equivalent ("xn--didejo-noas-1ic") of its Unicode hostname (I haven't checked as I'm not a Windows user, but I presume the UnicodeDecodeError came from gethost_common() in socketmodule.c and hence the name lookup was successful), so it would certainly be more helpful to return Unicode for non-ASCII gethostbyaddr() results there, if they were guaranteed to map to real IDNA hostnames in Windows environments.

    (That isn't guaranteed in Unix environments of course, which is why I'm still suggesting ASCII/surrogateescape for the general case.)

    vstinner commented 8 years ago

    FYI I created the issue bpo-26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page.

    Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*.