Closed michaelvl closed 7 years ago
The requests response documentation seems to concur with your suggestion:
For example, HTTP and XML have the ability to specify their encoding in their body. In situations like this, you should use
r.content
to find the encoding, and then setr.encoding
. This will let you user.text
with the correct encoding.
I'm also curious as to whether the OSM API is using a fixed encoding for all responses. If that is the case, we could probably solve this by just setting r.encoding
instead of having the xml
module infer it from the binary data. But either solution would likely work.
In the current implementation, the requests module properly infers the encoding from the http header 'content-encoding' and sets 'request.encoding' to 'utf-8'. When osmapi read 'request.text' it gets unicode(r.content, encoding='utf-8'). This unicode object is passed into minidom, which reads the '<?xml version="1.0" encoding="UTF-8"?>' header and thus assumes the unicode 'byte array' is encoded as utf-8, but that is not the case because of the python unicode abstraction. Googling unicode and minidom also seems to suggest that the recommended way to handle unicode XML is to let minidom do the encoding handling.
@MichaelVL Thanks for doing the research on that. I had noticed that the minidom documentation was somewhat sparse with regard to how it would handle different input encodings.
Ultimately, @metaodi needs to accept the patch, but if you wanted to create a pull request with the change you suggested, I think it would be accepted. Let me know if you need any help making the change or submitting the pull request.
@MichaelVL good catch! Would you mind creating a PR for that? I'd be happy to merge it. I will try to create a test to make sure this doesn't happen again.
@austinhartzheim Thanks for being a good "open source citizen" ;)
I'm still unsure if we should keep the minidom
code or replace it with lxml
. I mean this library doesn't need a lot of XML features, but still it seems lxml is the better choice and is actively maintained. Let me know if you have any thoughts about that as well.
I'm closing this issue as the PR got merged.
This issue concerns the latest 'requests' based osmapi.
I think the use of requests 'response.text' in '_http_request' and thus passing unicode into minidom is problematic. I think the proper way to do this is to use the raw data 'request.content' for passing into minidom. Subsequently minidom will read the encoding from the xml header (e.g. '<?xml version="1.0" encoding="UTF-8"?>') and handle the raw data occording to this encoding information.
I seem to be able to trigger unicode issues with the following code:
import osmapi a = osmapi.OsmApi() c = a.ChangesetDownload(37393499)
which generates this unicode exception:
Traceback (most recent call last): File "", line 1, in
File "osmapi/OsmApi.py", line 1324, in ChangesetDownload
return self.ParseOsc(data)
File "osmapi/OsmApi.py", line 1751, in ParseOsc
data = xml.dom.minidom.parseString(data)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 11709: ordinal not in range(128)
Changing '_http_request' to use 'request.content' corrects this issue.