Use response.content instead of response.text.encode("utf-8")?

sourcefilter commented 6 years ago

I believe this supersedes issue 20. The solution is the same but I've dug around a bit to explain why.

Reproducing the bug

from sickle import Sickle

record = (Sickle('https://archive-it.org/oai')
          .GetRecord(identifier='http://archive-it.org/collections/2323',
                     metadataPrefix='oai_dc'))

print(record.metadata, '\n\n', record.metadata['description'][0][124:137])

{'title': ['Jasmine Revolution - Tunisia 2011'], 'subject': ['spontaneousEvents', 'blogsAndSocialMedia', 'government-National'], 'description': ['This collection consists of websites documenting the revolution in Tunisia in 2011. Our partners at Library of Congress and BibliothÃ¨que Nationale de France have contributed websites for this collection, and the sites are primarily in French and Arabic with some in English.'], 'identifier': ['http://archive-it.org/collections/2323']} 

 BibliothÃ¨que

The problem

requests tries to be clever and detect the encoding
but it doesn't look at the explicit xml "encoding" property! (cf. requests docs)
thus, response.text is an incorrectly-decoded version of response.content

import requests

response = requests.get('https://archive-it.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=http://archive-it.org/collections/2323')
print(response.content[:38], response.encoding, sep='\n')

b'<?xml version="1.0" encoding="UTF-8"?>'
ISO-8859-1

More info: https://github.com/requests/requests/issues/1604

The solution

pass response.content (the raw response bytestring) to lxml instead of re-encoding response.text
presumably, lxml is aware of and uses the xml encoding element

from lxml import etree

tree = etree.XML(response.content)
(tree.getchildren()[2].getchildren()[0].getchildren()[1]
 .getchildren()[0].getchildren()[4].text[124:136])

'Bibliothèque'

sourcefilter commented 6 years ago

I'll write a test that captures this case, then submit a pull request along with @gugek's fix!

mloesch commented 6 years ago

Fixed in the latest release 0.6.3

mloesch / sickle

Use response.content instead of response.text.encode("utf-8")? #22

Reproducing the bug

The problem

The solution