Passing unicode into minidom

michaelvl commented 8 years ago

This issue concerns the latest 'requests' based osmapi.

I think the use of requests 'response.text' in '_http_request' and thus passing unicode into minidom is problematic. I think the proper way to do this is to use the raw data 'request.content' for passing into minidom. Subsequently minidom will read the encoding from the xml header (e.g. '<?xml version="1.0" encoding="UTF-8"?>') and handle the raw data occording to this encoding information.

I seem to be able to trigger unicode issues with the following code:

import osmapi a = osmapi.OsmApi() c = a.ChangesetDownload(37393499)

which generates this unicode exception:

Traceback (most recent call last): File "", line 1, in File "osmapi/OsmApi.py", line 1324, in ChangesetDownload return self.ParseOsc(data) File "osmapi/OsmApi.py", line 1751, in ParseOsc data = xml.dom.minidom.parseString(data) File "/usr/lib/python2.7/xml/dom/minidom.py", line 1928, in parseString return expatbuilder.parseString(string) File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 11709: ordinal not in range(128)

Changing '_http_request' to use 'request.content' corrects this issue.

austinhartzheim commented 8 years ago

The requests response documentation seems to concur with your suggestion:

For example, HTTP and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

I'm also curious as to whether the OSM API is using a fixed encoding for all responses. If that is the case, we could probably solve this by just setting r.encoding instead of having the xml module infer it from the binary data. But either solution would likely work.

michaelvl commented 8 years ago

In the current implementation, the requests module properly infers the encoding from the http header 'content-encoding' and sets 'request.encoding' to 'utf-8'. When osmapi read 'request.text' it gets unicode(r.content, encoding='utf-8'). This unicode object is passed into minidom, which reads the '<?xml version="1.0" encoding="UTF-8"?>' header and thus assumes the unicode 'byte array' is encoded as utf-8, but that is not the case because of the python unicode abstraction. Googling unicode and minidom also seems to suggest that the recommended way to handle unicode XML is to let minidom do the encoding handling.

austinhartzheim commented 8 years ago

@MichaelVL Thanks for doing the research on that. I had noticed that the minidom documentation was somewhat sparse with regard to how it would handle different input encodings.

Ultimately, @metaodi needs to accept the patch, but if you wanted to create a pull request with the change you suggested, I think it would be accepted. Let me know if you need any help making the change or submitting the pull request.

metaodi commented 8 years ago

@MichaelVL good catch! Would you mind creating a PR for that? I'd be happy to merge it. I will try to create a test to make sure this doesn't happen again.

@austinhartzheim Thanks for being a good "open source citizen" ;)

I'm still unsure if we should keep the minidom code or replace it with lxml. I mean this library doesn't need a lot of XML features, but still it seems lxml is the better choice and is actively maintained. Let me know if you have any thoughts about that as well.

metaodi commented 7 years ago

I'm closing this issue as the PR got merged.

metaodi / osmapi

Passing unicode into minidom #57