Open 50cef828-0e94-47a2-843d-cb13c9fb9120 opened 14 years ago
The functions in the socket module which return host/domain names, such as gethostbyaddr() and getnameinfo(), are wrappers around byte-oriented interfaces but return Unicode strings in 3.x, and have not been updated to deal with undecodable byte sequences in the results, as discussed in PEP-383.
Some DNS resolvers do discard hostnames not matching the ASCII-only RFC 1123 syntax, but checks for this may be absent or turned off, and non-ASCII bytes can be returned via other lookup facilities such as /etc/hosts.
Currently, names are converted to str objects using PyUnicode_FromString(), i.e. by attempting to decode them as UTF-8. This can fail with UnicodeError of course, but even if it succeeds, any non-ASCII names returned will fail to round-trip correctly because most socket functions encode string arguments into IDNA ASCII-compatible form before using them. For example, with UTF-8 encoded entries
127.0.0.2 € 127.0.0.3 xn--lzg
in /etc/hosts, I get:
Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from socket import *
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
Here, getaddrinfo() has encoded "€" to its corresponding ACE label "xn--lzg", which maps to a different address.
PEP-383 can't be applied as-is here, since if the name happened to be decodable in the file system encoding (and thus was returned as valid non-ASCII Unicode), the result would fail to round-trip correctly as shown above, but I think there is a solution which follows the general idea of PEP-383.
Surrogate characters are not allowed in IDNs, since they are prohibited by Nameprep[1][2], so if names were instead decoded as ASCII with the surrogateescape error handler, strings representing non-ASCII names would always contain surrogate characters representing the non-ASCII bytes, and would therefore fail to encode with the IDNA codec. Thus there would be no ambiguity between these strings and valid IDNs. The attached ascii-surrogateescape.diff does this.
The returned strings could then be made to round-trip as arguments, by having functions that take hostname arguments attempt to encode them using ASCII/surrogateescape first before trying IDNA encoding. Since IDNA leaves ASCII names unchanged and surrogate characters are not allowed in IDNs, this would not change the interpretation of any string hostnames that are currently accepted. It would remove the 63-octet limit on label length currently imposed due to the IDNA encoding, for ASCII names only, but since this is imposed due to the 63-octet limit of the DNS, and non-IDN names may be intended for other resolution mechanisms, I think this is a feature, not a bug :)
The patch try-surrogateescape-first.diff implements the above for all relevant interfaces, including gethostbyaddr() and getnameinfo(), which do currently accept hostnames, even if the documentation is vague (in the standard library, socket.fqdn() calls gethostbyaddr() with a hostname, and the "os" module docs suggest calling socket.gethostbyaddr(socket.gethostname()) to get the fully-qualified hostname).
The patch still allows hostnames to be passed as bytes objects, but to simplify the implementation, it removes support for bytearray (as has been done for pathnames in 3.2). Bytearrays are currently only accepted by the socket object methods (.connect(), etc.), and this is undocumented and perhaps unintentional - the get*() functions have never accepted them.
One problem with the surrogateescape scheme would be with existing code that looks up an address and then tries to write the hostname to a log file or use it as part of the wire protocol, since the surrogate characters would fail to encode as ASCII or UTF-8, but the code would appear to work normally until it encountered a non-ASCII hostname, allowing the problem to go undetected.
On the other hand, such code is probably broken as things stand, given that the address lookup functions can undocumentedly raise UnicodeError in the same situation. Also, protocol definitions often specify some variant of the RFC 1123 syntax for hostnames (thus making non-ASCII bytes illegal), so code that checked for this prior to encoding the name would probably be OK, but it's more likely the exception than the rule.
An alternative approach might be to return all hostnames as bytes objects, thus breaking everything immediately and obviously...
[1] http://tools.ietf.org/html/rfc3491#section-5 [2] http://tools.ietf.org/html/rfc3454#appendix-C.5
I like the idea of using the PEP-383 for hostnames, but I don't understand the relation with IDNA (maybe because I don't know this encoding).
+this leaves IDNA ASCII-compatible encodings in ASCII +form, but converts any non-ASCII bytes in the hostname to the Unicode +lone surrogate codes U+DC80...U+DCFF.
What is an "IDNA ASCII-compatible encoding"?
--
ascii-surrogateescape.diff:
try-surrogateescape-first.diff:
"Leaving IDNA ASCII-compatible encodings in ASCII form" is just preserving the existing behaviour (not doing IDNA decoding). See
http://tools.ietf.org/html/rfc3490
and the docs for codecs -> encodings.idna ("xn--lzg" in the example is the ASCII-compatible encoding of "€", so if you look up that IP address, "xn--lzg" is returned with or without the patch).
I'll look into your other comments. In the meantime, I've got one more patch, as the decoding of the nodename field in os.uname() also needs to be changed to match the other hostname-returning functions. This patch changes it to ASCII/surrogateescape, with the usual PEP-383 decoding for the other fields.
OK, here are new versions of the original patches.
I've tweaked the docs to make clear that ASCII-compatible encodings actually *are* ASCII, and point to an explanation as soon as they're mentioned.
You're right that PyUnicode_AsEncodedString() is the preferable interface for the argument converter (I think I got PyUnicode_AsEncodedObject() from an old version of PyUnicode_FSConverter() :/), but for the ASCII step I've just short-circuited it and used PyUnicode_EncodeASCII() directly, since the converter has already checked that the object is of Unicode type. For the IDNA step, PyUnicode_AsEncodedString() should result in a less confusing error message if the codec returns some non-bytes object one day.
However, the PyBytes_Check isn't to check up on the codec, but to check for a bytes argument, which the converter also supports. For that reason, I think encode_hostname would be a misleading name, but I've renamed it hostname_converter after the example of PyUnicode_FSConverter, and renamed unicode_from_hostname to decode_hostname.
I've also made the converter check for UnicodeEncodeError in the ASCII step, but the end result really is UnicodeError if the IDNA step fails, because the "idna" codec does not use UnicodeEncodeError or UnicodeDecodeError. Complain about that if you wish :)
I think the example I gave in the previous comment was also confusing, so just to be clear...
In /etc/hosts (in UTF-8 encoding):
127.0.0.2 € 127.0.0.3 xn--lzg
Without patches:
>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'
With patches:
>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 167, in encode
result.extend(ToASCII(label))
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 76, in ToASCII
label = nameprep(label)
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 38, in nameprep
raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'
The exception at the end demonstrates why surrogateescape strings don't get confused with IDNs.
I noticed that try-surrogateescape-first.diff missed out one of the string references that needed to be changed to point to the bytes object, and also used PyBytes_AS_STRING() in an unlocked section. This version fixes these things by taking the generally safer approach of setting the original char * variable to the hostname immediately after using hostname_converter().
Is this patch in response to an actual problem, or a theoretical problem? If "actual problem": what was the specific application, and what was the specific host name?
If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".
Is this patch in response to an actual problem, or a theoretical problem? If "actual problem": what was the specific application, and what was the specific host name?
It's about environments, not applications - the local network may be configured with non-ASCII bytes in hostnames (either in the local DNS *or* a different lookup mechanism - I mentioned /etc/hosts as a simple example), or someone might deliberately connect from a garbage hostname as a denial of service attack against a server which tries to look it up with gethostbyaddr() or whatever (this may require a "non-strict" resolver library, as noted above).
If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".
There are two points here. One is that the decoding can fail; I do think that programmers will find this surprising, and the fact that Python refuses to return what was actually received is a regression compared to 2.x.
The other is that the encoding and decoding are not symmetric - hostnames are being decoded with UTF-8 but encoded with IDNA. That means that when a decoded hostname contains a non-ASCII character which is not prohibited by IDNA/Nameprep, that string will, when used in a subsequent call, not refer to the hostname that was actually received, because it will be re-encoded using a different codec.
Attaching a refreshed version of try-surrogateescape-first.diff. I've separated out the change to getnameinfo() as it may be superfluous (issue bpo-1027206).
> Is this patch in response to an actual problem, or a theoretical problem? > If "actual problem": what was the specific application, and what was the specific host name?
It's about environments, not applications
Still, my question remains. Is it a theoretical problem (i.e. one of your imagination), or a real one (i.e. one you observed in real life, without explicitly triggering it)? If real: what was the specific environment, and what was the specific host name?
There are two points here. One is that the decoding can fail; I do think that programmers will find this surprising, and the fact that Python refuses to return what was actually received is a regression compared to 2.x.
True. However, I think this is an acceptable regression, assuming the problem is merely theoretical. It is ok if an operation fails that you will never run into in real life.
That means that when a decoded hostname contains a non-ASCII character which is not prohibited by IDNA/Nameprep, that string will, when used in a subsequent call, not refer to the hostname that was actually received, because it will be re-encoded using a different codec.
Again, I fail to see the problem in this. It won't happen in real life. However, if you worried that this could be abused, I think it should decode host names as ASCII, not as UTF-8. Then it will be symmetric again (IIUC).
> It's about environments, not applications
Still, my question remains. Is it a theoretical problem (i.e. one of your imagination), or a real one (i.e. one you observed in real life, without explicitly triggering it)? If real: what was the specific environment, and what was the specific host name?
Yes, I did reproduce the problem on my own system (Ubuntu 8.04). No, it is not from a real application, nor do I know anyone with their network configured like this (except possibly Dan "djbdns" Bernstein: http://cr.yp.to/djbdns/idn.html ).
I reported this bug to save anyone who *is* in such an environment from crashing applications and erroneous name resolution.
> That means that when a decoded hostname contains a non-ASCII > character which is not prohibited by IDNA/Nameprep, that string > will, when used in a subsequent call, not refer to the hostname > that was actually received, because it will be re-encoded using a > different codec.
Again, I fail to see the problem in this. It won't happen in real life. However, if you worried that this could be abused, I think it should decode host names as ASCII, not as UTF-8. Then it will be symmetric again (IIUC).
That would be an improvement. The idea of the patches I posted is to combine this with the existing surrogateescape mechanism, which handles situations like this perfectly well. I don't see how getting a UnicodeError is better than getting a string with some lone surrogates in it. In fact, it was my understanding of PEP-383 that it is in fact better to get the lone surrogates.
That would be an improvement. The idea of the patches I posted is to combine this with the existing surrogateescape mechanism, which handles situations like this perfectly well.
The surrogateescape mechanism is a very hackish approach, and violates the principle that errors should never pass silently. However, it solves a real problem - people do run into the problem with file names every day. With this problem, I'd say "if it hurts, don't do it, then".
The surrogateescape mechanism is a very hackish approach, and violates the principle that errors should never pass silently.
I don't see how a name resolution API returning non-ASCII bytes would indicate an error. If the host table contains a non-ASCII byte sequence for a host, then that is the host's name - it works just as well as an ASCII name, both forwards and backwards.
What is hackish is representing char * data as a Unicode string when there is no native Unicode API to feed it to - there is no issue here such as file names being bytes on Unix and Unicode on Windows, so the clean thing to do would be to return a bytes object. I suggested the surrogateescape mechanism in order to retain backwards compatibility.
However, it solves a real problem - people do run into the problem with file names every day. With this problem, I'd say "if it hurts, don't do it, then".
But to be more explicit, that's like saying "if it hurts, get your sysadmin to reconfigure the company network".
I don't see how a name resolution API returning non-ASCII bytes would indicate an error.
It's in violation of RFC 952 (slightly relaxed by RFC 1123).
But to be more explicit, that's like saying "if it hurts, get your sysadmin to reconfigure the company network".
Which I consider perfectly reasonable. The sysadmin should have known (and, in practice, *always* knows) not to do that in the first place (the larger the company, the more cautious the sysadmin).
> I don't see how a name resolution API returning non-ASCII bytes > would indicate an error.
It's in violation of RFC 952 (slightly relaxed by RFC 1123).
That's bad if it's on the public Internet, but it's not an error. The OS is returning the name by which it knows the host.
If you look at POSIX, you'll see that what getaddrinfo() and getnameinfo() look up and return is referred to as a "node name", which can be an address string or a "descriptive name", and that if used with Internet address families, descriptive names "include" host names. It doesn't say that the string can only be an address string or a hostname (RFC 1123 compliant or otherwise).
> But to be more explicit, that's like saying "if it hurts, get > your sysadmin to reconfigure the company network".
Which I consider perfectly reasonable. The sysadmin should have known (and, in practice, *always* knows) not to do that in the first place (the larger the company, the more cautious the sysadmin).
It's not reasonable when addressed to a customer who might go elsewhere. And I still don't see a technical reason for making such a demand. Python 2.x seems to work just fine using 8-bit strings.
It's not reasonable when addressed to a customer who might go elsewhere.
I remain -1 on this change, until such a customer actually shows up at a Python developer.
OK, I still think this issue should be addressed, but here is a patch for the part we agree on: that decoding should not return any Unicode characters except ASCII.
The rest of the issue could also be straightforwardly addressed by adding bytes versions of the name lookup APIs. Attaching a patch which does that (applies on top of decode-strict-ascii.diff).
Oops, forgot to refresh the last change into that patch. This should fix it.
platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)
There are a lot of our Blender users that are not english native-speakers and they set up their machine as they please, against RCFs or not.
This currently breaks some code that use platform.system() to check the system it's run on. The paste from above is from a user who has named his computer Nötkötti.
It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.
platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)
This trace is from a Windows system, where the platform module uses gethostname() in its cross-platform uname() function, which platform.system() and various other functions in the module rely on. On a Unix system, platform.uname() depends on os.uname() working, meaning that these functions still fail when the hostname cannot be decoded, as it is part of os.uname()'s return value.
Given that os.uname() is a primary source of information about the platform on Unix systems, this sort of collateral damage from an undecodable hostname is likely to occur in more places.
It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.
If you'd like to try the surrogateescape patches, they ought to fix this. The relevant patches are ascii-surrogateescape-2.diff, try-surrogateescape-first-4.diff and uname-surrogateescape.diff.
The failure of platform.uname is an independent bug. IMO, it shouldn't use socket.gethostname on Windows, but instead look at the COMPUTERNAME environment variable or call the GetComputerName API function. This is more close to what uname() does on Unix (i.e. retrieve the local machine name independent of DNS).
I have created bpo-10097 for this bug.
As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS). We should then assume that it is encoded in the locale encoding (in particular, that it is encoded in mbcs on Windows).
Regarding fixing the issue at hand on Windows, I think Python should use the corresponding win32 API for getting the hostname: GetComputerNameEx().
It supports Unicode, so the encoding issue doesn't arise.
See http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx for details.
This also solves the platform.uname() issue mentioned here, since the uname() emulation for Windows relies on socket.gethostname() to determine the node name.
FWIW: Glib C does the reverse...
The GNU C library implements gethostname() as a library function that calls
uname(2) and copies up to len bytes from the returned nodename field into
name.
As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS).
But the hostname *is* commonly intended to be looked up in the DNS or whatever name resolution mechanisms are used locally - socket.getfqdn(), for instance, works by looking up the result using gethostbyaddr() (actually the C function getaddrinfo(), followed by gethostbyaddr()). So I don't see the rationale for treating it differently from the results of gethostbyaddr(), getnameinfo(), etc.
POSIX says of the name lookup functions that "in many cases" they are implemented by the Domain Name System, not that they always are, so a name intended for lookup need not be ASCII-only either.
We should then assume that it is encoded in the locale encoding (in particular, that it is encoded in mbcs on Windows).
I can see the point of returning the characters that were intended, but code that looked up the returned name would then have to be changed to re-encode it to bytes to avoid the round-tripping issue when non-ASCII characters are returned.
Am 15.10.2010 20:03, schrieb David Watson:
David Watson \baikie@users.sourceforge.net\ added the comment:
> As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS).
But the hostname *is* commonly intended to be looked up in the DNS or whatever name resolution mechanisms are used locally - socket.getfqdn(), for instance, works by looking up the result using gethostbyaddr() (actually the C function getaddrinfo(), followed by gethostbyaddr()). So I don't see the rationale for treating it differently from the results of gethostbyaddr(), getnameinfo(), etc.
The result from gethostname likely comes out of machine-local configuration. It may have non-ASCII in it, which is then likely encoded in the local encoding. When looking it up in DNS, IDNA should be applied.
OTOH, output from gethostbyaddr likely comes out of the DNS itself. Guessing what encoding it may have is futile - other than guessing that it really ought to be ASCII.
I can see the point of returning the characters that were intended, but code that looked up the returned name would then have to be changed to re-encode it to bytes to avoid the round-tripping issue when non-ASCII characters are returned.
Python's socket module is clearly focused on the internet, and intends to support that well. So if you pass a non-ASCII string, it will have to encode that using IDNA. If that's not what you want to get, tough luck.
The result from gethostname likely comes out of machine-local configuration. It may have non-ASCII in it, which is then likely encoded in the local encoding. When looking it up in DNS, IDNA should be applied.
I would have thought that someone who intended a Unicode hostname to be looked up in its IDNA form would have encoded it using IDNA, rather than an 8-bit encoding - how many C programs would transcode the name that way, rather than just passing the char * from one interface to another?
In fact, I would think that non-ASCII bytes in a hostname most probably indicated that a name resolution mechanism other than the DNS was in use, and that the byte string should be passed unaltered just as a typical C program would.
OTOH, output from gethostbyaddr likely comes out of the DNS itself. Guessing what encoding it may have is futile - other than guessing that it really ought to be ASCII.
Sure, but that doesn't mean the result can't be made to round-trip if it turns out not to be ASCII. The guess that it will be ASCII is, after all, still a guess (as is the guess that it comes from the DNS).
Python's socket module is clearly focused on the internet, and intends to support that well. So if you pass a non-ASCII string, it will have to encode that using IDNA. If that's not what you want to get, tough luck.
I don't object to that, but it does force a choice between decoding an 8-bit name for display (e.g. by using the locale encoding), and decoding it to round-trip automatically (e.g. by using ASCII/surrogateescape, with support on the encoding side).
Using PyUnicode_DecodeFSDefault() for the hostname or other returned names (thus decoding them for display) would make this issue solvable with programmer intervention - for instance, "socket.gethostbyaddr(socket.gethostname())" could be replaced by "socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but programmers might well neglect to do this, given that no encoding was needed in Python 2.
Also, even displaying a non-ASCII name decoded according to the locale creates potential for confusion, as when the user types the same characters into a Python program for lookup (again, barring programmer intervention), they will not represent the same byte sequence as the characters the user sees on the screen (as they will instead represent their IDNA ASCII-compatible equivalent).
So overall, I do think it is better to decode names for automatic round-tripping rather than for display, but my main concern is simply that it should be possible to recover the original bytes so that round-tripping is at least possible. PyUnicode_DecodeFSDefault() would accomplish that much at least.
I would have thought that someone who intended a Unicode hostname to be looked up in its IDNA form would have encoded it using IDNA, rather than an 8-bit encoding - how many C programs would transcode the name that way, rather than just passing the char * from one interface to another?
Well, Python is not C. In Python, you would pass a str, and expect it to work, which means it will get automatically encoded with IDNA.
In fact, I would think that non-ASCII bytes in a hostname most probably indicated that a name resolution mechanism other than the DNS was in use, and that the byte string should be passed unaltered just as a typical C program would.
I'm not talking about byte strings, but character strings.
I don't object to that, but it does force a choice between decoding an 8-bit name for display (e.g. by using the locale encoding), and decoding it to round-trip automatically (e.g. by using ASCII/surrogateescape, with support on the encoding side).
In the face of ambiguity, refuse the temptation to guess.
So overall, I do think it is better to decode names for automatic round-tripping rather than for display, but my main concern is simply that it should be possible to recover the original bytes so that round-tripping is at least possible.
Marc-Andre wants gethostname to use the Wide API on Windows, which, in theory, allows for cases where round-tripping to bytes is impossible.
> In fact, I would think that non-ASCII bytes in a hostname most > probably indicated that a name resolution mechanism other than > the DNS was in use, and that the byte string should be passed > unaltered just as a typical C program would.
I'm not talking about byte strings, but character strings.
I mean that passing the str object from socket.gethostname() to the Python lookup function ought to result in the same byte string being passed to the C lookup function as was returned by the C gethostname() function (or else that the programmer must re-encode the str to ensure that that result is obtained).
> I don't object to that, but it does force a choice between > decoding an 8-bit name for display (e.g. by using the locale > encoding), and decoding it to round-trip automatically (e.g. by > using ASCII/surrogateescape, with support on the encoding side).
In the face of ambiguity, refuse the temptation to guess.
Yes, I would interpret that to mean not using the locale encoding for data obtained from the network. That's another reason why the ASCII/surrogateescape scheme appeals to me more.
Well, Python is not C. In Python, you would pass a str, and expect it to work, which means it will get automatically encoded with IDNA.
I think there might be a misunderstanding here - I've never proposed changing the interpretation of Unicode characters in hostname arguments. The ASCII/surrogateescape scheme I suggested only changes the interpretation of unpaired surrogate codes, as they do not occur in IDNs or any other genuine Unicode data; all IDNs, including those solely consisting of ASCII characters, would be encoded to the same byte sequence as before.
ASCII/surrogateescape decoding could also be used without support on the encoding side - that would satisfy the requirement to "refuse the temptation to guess", would allow the original bytes to be recovered, and would mean that attempting to look up a non-ASCII result in str form would raise an exception rather than looking up the wrong name.
Marc-Andre wants gethostname to use the Wide API on Windows, which, in theory, allows for cases where round-tripping to bytes is impossible.
Well, the name resolution APIs wrapped by Python are all byte-oriented, so if the computer name were to have no bytes equivalent then it wouldn't be possible to resolve it anyway, and an exception rightly ought be raised at some point in the process of trying to do so.
I was looking at the MSDN pages linked to above, and these two pages seemed to suggest that Unicode characters appearing in DNS names represented UTF-8 sequences, and that Windows allowed such non-ASCII byte sequences in the DNS by default:
http://msdn.microsoft.com/en-us/library/ms724220%28v=VS.85%29.aspx http://msdn.microsoft.com/en-us/library/ms682032%28v=VS.85%29.aspx
(See the discussion of DNS_ERROR_NON_RFC_NAME in the latter.) Can anyone confirm if this is the case?
The BSD-style gethostname() function can't be returning UTF-8, though, or else the "Nötkötti" example above would have been decoded successfully, given that Python currently uses PyUnicode_FromString().
Also, if GetComputerNameEx() only offers a choice of DNS names or NetBIOS names, and both are byte-oriented underneath (that was my reading of the "Computer Names" page), then presumably there shouldn't be a problem with mapping the result to a bytes equivalent when necessary?
Also, if GetComputerNameEx() only offers a choice of DNS names or NetBIOS names, and both are byte-oriented underneath (that was my reading of the "Computer Names" page), then presumably there shouldn't be a problem with mapping the result to a bytes equivalent when necessary?
They aren't byte-oriented underneath.It depends on whether use GetComputerNameA or GetComputerNameW whether you get bytes or Unicode. If bytes, they are converted as if by WideCharToMultiByte using CP_ACP, which in turn will introduce question marks and the like for unconvertable characters.
> Also, if GetComputerNameEx() only offers a choice of DNS names or > NetBIOS names, and both are byte-oriented underneath (that was my > reading of the "Computer Names" page), then presumably there > shouldn't be a problem with mapping the result to a bytes > equivalent when necessary?
They aren't byte-oriented underneath.It depends on whether use GetComputerNameA or GetComputerNameW whether you get bytes or Unicode. If bytes, they are converted as if by WideCharToMultiByte using CP_ACP, which in turn will introduce question marks and the like for unconvertable characters.
Sorry, I didn't mean how Windows constructs the result for the "A" interface - I was talking about Python code being able to map the result from the Unicode interface to the form used in the protocol (e.g. DNS). I believe the proposal is to use the DNS name, so since the DNS is byte oriented, I would have thought that the Unicode "DNS name" result would always have a bytes equivalent that the DNS resolver code would use - perhaps its UTF-8 encoding?
Sorry, I didn't mean how Windows constructs the result for the "A" interface - I was talking about Python code being able to map the result from the Unicode interface to the form used in the protocol (e.g. DNS). I believe the proposal is to use the DNS name
I disagree with the proposal - it should return whatever name gethostname from winsock.dll returns (which I expect to be the netbios name).
so since the DNS is byte oriented, I would have thought that the Unicode "DNS name" result would always have a bytes equivalent that the DNS resolver code would use - perhaps its UTF-8 encoding?
No no no. When Microsoft calls it the DNS name, they don't actually mean that it has to do anything with DNS. In particular, it's not byte-oriented.
Martin v. Löwis wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
> Sorry, I didn't mean how Windows constructs the result for the > "A" interface - I was talking about Python code being able to map > the result from the Unicode interface to the form used in the > protocol (e.g. DNS). I believe the proposal is to use the DNS > name
I disagree with the proposal - it should return whatever name gethostname from winsock.dll returns (which I expect to be the netbios name).
> so since the DNS is byte oriented, I would have thought > that the Unicode "DNS name" result would always have a bytes > equivalent that the DNS resolver code would use - perhaps its > UTF-8 encoding?
No no no. When Microsoft calls it the DNS name, they don't actually mean that it has to do anything with DNS. In particular, it's not byte-oriented.
Just to clarify: I was proposing to use the GetComputerNameExW() win32 API with ComputerNamePhysicalDnsHostname, which returns Unicode without needing any roundtrip via bytes and the issues associated with this.
I don't understand why Martin insists that the MS "DNS name" doesn't have anything to with DNS... the fully qualified DNS name of a machine is determined as hostname.domainname, just like you would expect in DNS.
http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx http://msdn.microsoft.com/en-us/library/ms724224(v=VS.85).aspx
As I said earlier: NetBIOS is being phased out in favor of DNS. MS is using a convention which mandates that NetBIOS names match DNS names. The only difference between the two is that NetBIOS names have a length limitation:
http://msdn.microsoft.com/en-us/library/ms724931(v=VS.85).aspx
Perhaps Martin could clarify why he insists on using the ANSI WinSock interface gethostname instead.
PS: WinSock provides many other Unicode APIs for socket module interfaces as well, so at least for that platform, we could use those to resolve uncertainties about the encoding used in name resolution.
On other platforms, I guess we'll just have to do some trial and error to see what works and what not. E.g. on Linux it is possible to set the hostname to a non-ASCII value, but then the resolver returns an error, so it's not very practical:
# hostname l\303\266wis # hostname löwis # hostname -f hostname: Resolver Error 0 (no error)
Using the IDNA version doesn't help either:
# hostname xn--lwis-5qa # hostname xn--lwis-5qa # hostname -f hostname: Resolver Error 0 (no error)
Python2 happily returns the host name, but fails to return a fully qualified domain name:
'l\xc3\xb6wis'
>>> socket.getfqdn()
'l\xc3\xb6wis'
and
'xn--lwis-5qa'
>>> socket.getfqdn()
'xn--lwis-5qa'
Just for comparison:
# hostname newton # hostname newton # hostname -f newton.egenix.internal
and
'newton'
>>> socket.getfqdn()
'newton.egenix.internal'
So at least on Linux, using non-ASCII hostnames doesn't really appear to be an option at this time.
On other platforms, I guess we'll just have to do some trial and error to see what works and what not. E.g. on Linux it is possible to set the hostname to a non-ASCII value, but then the resolver returns an error, so it's not very practical:
hostname l\303\266wis
hostname
löwis
hostname -f
hostname: Resolver Error 0 (no error)
Using the IDNA version doesn't help either:
hostname xn--lwis-5qa
hostname
xn--lwis-5qa
hostname -f
hostname: Resolver Error 0 (no error)
I think what's happening here is that simply that you're setting the hostname to something which doesn't exist in the relevant name databases - the man page for Linux's hostname(1) says that "The FQDN is the name gethostbyname(2) returns for the host name returned by gethostname(2).". If the computer's usual name is "newton", that may be why it works and the others don't.
It works for me if I add "127.0.0.9 löwis.egenix.com löwis" to /etc/hosts and then set the hostname to "löwis" (all UTF-8): hostname -f prints "löwis.egenix.com", and Python 2's socket.getfqdn() returns the corresponding bytes; non-UTF-8 names work too. (Note that the FQDN must appear before the bare hostname in the /etc/hosts entry, and I used the address 127.0.0.9 simply to avoid a collision with existing entries - by default, Ubuntu assigns the FQDN to 127.0.1.1.)
Looks like we have our first customer (bpo-10223).
I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found
So my theory of how this all fits together is this:
it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.
gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).
the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.
In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.
Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".
r85934 now uses GetComputerNameExW on Windows.
Martin v. Löwis wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found
- COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
- GetComputerNameEx(5) is "π3141"
- socket.gethostname of Python 2.5 returns "p3141".
So my theory of how this all fits together is this:
- it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.
The MS docs mention that setting the DNS name will adjust the NetBIO name as well (with the NetBIOS name being converted to upper case and truncated, if the DNS name is too long).
They don't mention anything about the NetBIOS name encoding.
gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).
the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.
In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.
Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".
The DNS name of the Windows machine is the combination of the DNS host name and the DNS domain that you setup on the machine. I think the misunderstanding is that you assume this combination will somehow appear as known DNS name of the machine via some DNS server on the network - that's not the case.
Of course, it's not particularly useful to set the DNS name to something that other machines cannot find out via an DNS query.
FWIW, you can do the same on a Linux box, i.e. setup the host name and domain to some completely bogus values. And as David pointed out, without also updating the /etc/hosts on the Linux, you always get the resolver error with hostname -f I mentioned earlier on (which does a DNS lookup), so there's no real connection to the DNS system on Linux either.
Martin v. Löwis wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
r85934 now uses GetComputerNameExW on Windows.
Thanks, Martin.
Here's a similar discussion of the Windows approach (used in bzr):
https://bugs.launchpad.net/bzr/+bug/256550/comments/6
This is what Solaris uses:
http://developers.sun.com/dev/gadc/faq/locale.html#get-set
(they require conversion to ASCII and using IDNA for non-ASCII names)
I found this RFC draft on the topic: http://tools.ietf.org/html/draft-josefsson-getaddrinfo-idn-00 which suggests that there is no standard for the encoding used by the socket host name APIs yet.
ASCII, UTF-8 and IDNA are happily mixed and matched.
The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.
The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).
If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.
Martin v. Löwis wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.
The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).
If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.
Wouldn't it be better to also attempt to decode the name using IDNA in case the name starts with the IDNA prefix ?
This would then also cover the Solaris case.
The DNS name of the Windows machine is the combination of the DNS host name and the DNS domain that you setup on the machine. I think the misunderstanding is that you assume this combination will somehow appear as known DNS name of the machine via some DNS server on the network - that's not the case.
I don't assume that - I merely point it that it clearly has no relationship to the DNS (unless you explicitly make it that way). So, I wonder why they call it the DNS name - they could have just as well called the "LDAP name", or the "NIS name". In either case, setting the name would have no impact on the respective naming infrastructure.
FWIW, you can do the same on a Linux box, i.e. setup the host name and domain to some completely bogus values. And as David pointed out, without also updating the /etc/hosts on the Linux, you always get the resolver error with hostname -f I mentioned earlier on (which does a DNS lookup), so there's no real connection to the DNS system on Linux either.
Yes, but Linux (rightly) calls it the "hostname", not the "DNS name".
Wouldn't it be better to also attempt to decode the name using IDNA in case the name starts with the IDNA prefix ?
Perhaps better - but incompatible. I don't see a way to have the resolver functions automatically decode IDNA, without potentially breaking existing applications that specifically look for the IDNA prefix (say).
The code in socketmodule.c currently compile with suspect warnings:
socketmodule.c(3108) : warning C4047: 'function' : 'LPSTR' differs in levels of indirection from 'int' socketmodule.c(3108) : warning C4024: 'GetComputerNameA' : different types for formal and actual parameter 1 socketmodule.c(3109) : warning C4133: 'function' : incompatible types - from 'Py_UNICODE *' to 'LPDWORD' socketmodule.c(3110) : warning C4020: 'GetComputerNameA' : too many actual parameters
was GetComputerName() used instead of GetComputerNameExW()?
FWIW, you can do the same on a Linux box, i.e. setup the host name and domain to some completely bogus values. And as David pointed out, without also updating the /etc/hosts on the Linux, you always get the resolver error with hostname -f I mentioned earlier on (which does a DNS lookup), so there's no real connection to the DNS system on Linux either.
Just to clarify here: there isn't anything special about /etc/hosts; it's handled by a pluggable module which performs hostname lookups in it alongside a similar module for the DNS. glibc's Name Service Switch combines the views provided by the various modules into a single byte-oriented namespace for hostnames according to the settings in /etc/nssswitch.conf (this namespace allows non-ASCII bytes, as the /etc/hosts examples demonstrate).
http://www.kernel.org/doc/man-pages/online/pages/man5/nsswitch.conf.5.html http://www.gnu.org/software/libc/manual/html_node/Name-Service-Switch.html
It's an extensible system, so people can write their own modules to handle whatever name services they have to deal with, and configure hostname lookup to query them before, after or instead of the DNS. A hostname that is not resolvable in the DNS may be resolvable in one of these.
I faced with the issue on my own PC. For a Russian version of WinOS default PC name is ИВАН-ПК (C8 C2 C0 CD 2D CF CA in hex) and it returns from gethostbyaddr (CRT) exactly in this form (encoded with system locale cp1251 not UTF8). So when the function PyUnicode_FromString is called, it expects that argument is utf8 encoded string and throws and error. A lot of 3rd party modules use gethostbyaddr or getfqdn (which uses gethostbyaddr) and I can't just use function that returns names as bytes. Surrogate names are also not acceptable because the name mentioned above becomes ????-??
Nick, which version of Python are you using? And which function are you running exactly? It seems that a4fd3dc74299 fixed the issue, this was included with 3.2.
Originally I tried 3.2.2 (32bit), but I've just checked 3.2.3 and got the same. A code for reproduce is simple:
from socket import gethostbyaddr
a = gethostbyaddr('127.0.0.1')
leads to:
Traceback (most recent call last):
File "C:\Users\user\test\test.py", line 13, in <module>
a = gethostbyaddr('127.0.0.1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte
Or more complex sample:
def main():
import http.server
port = 80
handlerClass = http.server.SimpleHTTPRequestHandler
srv = http.server.HTTPServer(("", port), handlerClass )
srv.serve_forever()
if __name__ == "__main__":
main()
Attempt of connection to the server leads to:
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 1156)
Traceback (most recent call last):
File "C:\Python32\lib\socketserver.py", line 284, in _handle_request_noblock
self.process_request(request, client_address)
File "C:\Python32\lib\socketserver.py", line 310, in process_request
self.finish_request(request, client_address)
File "C:\Python32\lib\socketserver.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "C:\Python32\lib\socketserver.py", line 637, in __init__
self.handle()
File "C:\Python32\lib\http\server.py", line 396, in handle
self.handle_one_request()
File "C:\Python32\lib\http\server.py", line 384, in handle_one_request
method()
File "C:\Python32\lib\http\server.py", line 657, in do_GET
f = self.send_head()
File "C:\Python32\lib\http\server.py", line 701, in send_head
self.send_response(200)
File "C:\Python32\lib\http\server.py", line 438, in send_response
self.log_request(code)
File "C:\Python32\lib\http\server.py", line 483, in log_request
self.requestline, str(code), str(size))
File "C:\Python32\lib\http\server.py", line 517, in log_message
(self.address_string(),
File "C:\Python32\lib\http\server.py", line 559, in address_string
return socket.getfqdn(host)
File "C:\Python32\lib\socket.py", line 355, in getfqdn
hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte
P.S. My PC name is "USER-ПК"
a4fd3dc74299 only fixed socket.gethostname(), not socket.gethostbyaddr().
For Windows versions that support it, we could use GetNameInfoW, available on XPSP2+, W2k3+ and Vista+.
The questions then are: what to do about gethostbyaddr, and what to do about the general case?
Since the problem appears to be specific to Windows, it might be appropriate to find a solution to just the Windows case, and ignore the general issue. For gethostbyaddr, decoding would then use CP_ACP.
I'd add that this bug is very practical and can render a lot of software unusable/noisy/confusing on Windows, including Django (I discovered this bug when mentoring on Django Girls].
The simple step to reproduce is to take any windows and set regional settings to non-English (I've used Czech). You can verify that using "import locale; locale.getpreferredencoding()", that should display something else ("cp1250" in my case).
Then, set "name" (= hostname, in Windows settings) of the computer to anything containing non-ascii character (like "Didejo-noťas").
As Windows apparently encodes the hostname using their default encoding, it fails with
File "C:\Python34\lib\wsgiref\simple_server.py", line 50, in server_bind
HTTPServer.server_bind(self)
File "C:\Python34\lib\http\server.py", line 135, in server_bind
self.server_name = socket.getfqdn(host)
File "C:\Python34\lib\socket.py", line 463, in getfqdn
hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 9: invalid
start byte
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['extension-modules', 'type-bug']
title = 'socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names'
updated_at =
user = 'https://bugs.python.org/baikie'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation =
creator = 'baikie'
dependencies = []
files = ['18195', '18196', '18259', '18272', '18273', '18609', '18616', '18617', '18674', '18676', '39812', '39813']
hgrepos = []
issue_num = 9377
keywords = ['patch']
message_count = 52.0
messages = ['111550', '111766', '111985', '112094', '114688', '114710', '114754', '114756', '114847', '114882', '115014', '115030', '115116', '115119', '115185', '115186', '115187', '118582', '118602', '118617', '118694', '118709', '118816', '118952', '119051', '119076', '119177', '119230', '119231', '119245', '119260', '119271', '119346', '119837', '119918', '119925', '119927', '119928', '119929', '119935', '119941', '119943', '119946', '120081', '158118', '158165', '158175', '158178', '159776', '243311', '245826', '259079']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'vstinner', 'baikie', 'ezio.melotti', 'r.david.murray', 'jesterKing', 'spaun2002', 'steve.dower', 'Almad']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue9377'
versions = ['Python 3.2']
```