mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

force requests response to utf-8 #9

Closed atomotic closed 8 years ago

atomotic commented 8 years ago

Record.raw appear as unicode, but its content seems always ISO-8859-1 it's correct to force requests response encoding?

Aubreymcfato commented 8 years ago

Much needed. Using Sickle + Unicodecsv gives me a lot of strange characters, and consequently a lot of work to do :-(

cormier commented 8 years ago

I have had this issue with a server that specifies a text/html content-type but no explicit charset

When that happens, requests parses it as ISO-8859-1 [1]:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

According to the OAI specs, a server should be sending UTF-8 encoded document:

All responses to OAI-PMH requests must be well-formed XML instance documents. Encoding of the XML must use the UTF-8 representation of Unicode. Character references, rather than entity references, must be used. Character references allow XML responses to be treated as stand-alone documents that can be manipulated without dependency on entity declarations external to the document.

This is the case in my situation, but the XML, is parsed as ISO-8859-1, so I'm having encoding issues. The best solution would be that the server follows RFC 2616 and specify a charset if it expects its responses to be parsed as UTF-8. However, since we don't have any control on this, I think that forcing the output to be parsed as UTF-8 is acceptable, but a better solution would be to have this be user configurable (in case the server does not follow neither of RFC 2616 and the OAI spec)

[1] http://docs.python-requests.org/en/master/user/advanced/#request-and-response-objects [2] https://www.openarchives.org/OAI/openarchivesprotocol.html

mloesch commented 8 years ago

With release 0.6 it is now possible to explicitly specify the encoding when instantiating the Sickle object: http://sickle.readthedocs.io/en/v0.6/api.html#sickle.app.Sickle

mloesch commented 8 years ago

Make that release 0.6.1, 0.6 is lost (I had some issues with PyPI) http://sickle.readthedocs.io/en/v0.6.1/api.html#sickle.app.Sickle