I believe this supersedes issue 20. The solution is the same but I've dug around a bit to explain why.
Reproducing the bug
from sickle import Sickle
record = (Sickle('https://archive-it.org/oai')
.GetRecord(identifier='http://archive-it.org/collections/2323',
metadataPrefix='oai_dc'))
print(record.metadata, '\n\n', record.metadata['description'][0][124:137])
{'title': ['Jasmine Revolution - Tunisia 2011'], 'subject': ['spontaneousEvents', 'blogsAndSocialMedia', 'government-National'], 'description': ['This collection consists of websites documenting the revolution in Tunisia in 2011. Our partners at Library of Congress and Bibliothèque Nationale de France have contributed websites for this collection, and the sites are primarily in French and Arabic with some in English.'], 'identifier': ['http://archive-it.org/collections/2323']}
Bibliothèque
The problem
requests tries to be clever and detect the encoding
but it doesn't look at the explicit xml "encoding" property! (cf. requests docs)
thus, response.text is an incorrectly-decoded version of response.content
pass response.content (the raw response bytestring) to lxml instead of re-encoding response.text
presumably, lxml is aware of and uses the xml encoding element
from lxml import etree
tree = etree.XML(response.content)
(tree.getchildren()[2].getchildren()[0].getchildren()[1]
.getchildren()[0].getchildren()[4].text[124:136])
I believe this supersedes issue 20. The solution is the same but I've dug around a bit to explain why.
Reproducing the bug
The problem
More info: https://github.com/requests/requests/issues/1604
The solution