Open 50cef828-0e94-47a2-843d-cb13c9fb9120 opened 14 years ago
I've updated the ASCII/surrogateescape patches in line with various changes to Python since I posted them.
return-ascii-surrogateescape-2015-06-25.diff incorporates the ascii-surrogateescape and uname-surrogateescape patches, and accept-ascii-surrogateescape-2015-06-25.diff corresponds to the try-surrogateescape-first patch. Neither patch touches gethostname() on Windows.
Python's existing code now has a fast path for ASCII-only strings which passes them through unchanged (Unicode -> ASCII), so in order not to slow down processing of valid IDNs, the latter patch now effectively tries encodings in the order
ASCII/strict (existing code, fast path) IDNA/strict (existing code) ASCII/surrogateescape (added by patch)
rather than the previous
ASCII/surrogateescape IDNA/strict
This doesn't change the behaviour of the patch, since IDNA always rejects strings containing surrogate codes, and either rejects ASCII-only strings (e.g. when a label is longer than 63 characters) or passes them through unchanged.
These patches would at least allow getfqdn() to work in Almad's example, but in that case the host also appears to be addressable by the IDNA equivalent ("xn--didejo-noas-1ic") of its Unicode hostname (I haven't checked as I'm not a Windows user, but I presume the UnicodeDecodeError came from gethost_common() in socketmodule.c and hence the name lookup was successful), so it would certainly be more helpful to return Unicode for non-ASCII gethostbyaddr() results there, if they were guaranteed to map to real IDNA hostnames in Windows environments.
(That isn't guaranteed in Unix environments of course, which is why I'm still suggesting ASCII/surrogateescape for the general case.)
FYI I created the issue bpo-26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page.
Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['extension-modules', 'type-bug']
title = 'socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names'
updated_at =
user = 'https://bugs.python.org/baikie'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation =
creator = 'baikie'
dependencies = []
files = ['18195', '18196', '18259', '18272', '18273', '18609', '18616', '18617', '18674', '18676', '39812', '39813']
hgrepos = []
issue_num = 9377
keywords = ['patch']
message_count = 52.0
messages = ['111550', '111766', '111985', '112094', '114688', '114710', '114754', '114756', '114847', '114882', '115014', '115030', '115116', '115119', '115185', '115186', '115187', '118582', '118602', '118617', '118694', '118709', '118816', '118952', '119051', '119076', '119177', '119230', '119231', '119245', '119260', '119271', '119346', '119837', '119918', '119925', '119927', '119928', '119929', '119935', '119941', '119943', '119946', '120081', '158118', '158165', '158175', '158178', '159776', '243311', '245826', '259079']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'vstinner', 'baikie', 'ezio.melotti', 'r.david.murray', 'jesterKing', 'spaun2002', 'steve.dower', 'Almad']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue9377'
versions = ['Python 3.2']
```