mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

Use response.content instead of response.text.encode("utf-8")? #22

Closed sourcefilter closed 6 years ago

sourcefilter commented 6 years ago

I believe this supersedes issue 20. The solution is the same but I've dug around a bit to explain why.

Reproducing the bug

from sickle import Sickle

record = (Sickle('https://archive-it.org/oai')
          .GetRecord(identifier='http://archive-it.org/collections/2323',
                     metadataPrefix='oai_dc'))

print(record.metadata, '\n\n', record.metadata['description'][0][124:137])
{'title': ['Jasmine Revolution - Tunisia 2011'], 'subject': ['spontaneousEvents', 'blogsAndSocialMedia', 'government-National'], 'description': ['This collection consists of websites documenting the revolution in Tunisia in 2011. Our partners at Library of Congress and Bibliothèque Nationale de France have contributed websites for this collection, and the sites are primarily in French and Arabic with some in English.'], 'identifier': ['http://archive-it.org/collections/2323']} 

 Bibliothèque

The problem

import requests

response = requests.get('https://archive-it.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=http://archive-it.org/collections/2323')
print(response.content[:38], response.encoding, sep='\n')
b'<?xml version="1.0" encoding="UTF-8"?>'
ISO-8859-1

More info: https://github.com/requests/requests/issues/1604

The solution

from lxml import etree

tree = etree.XML(response.content)
(tree.getchildren()[2].getchildren()[0].getchildren()[1]
 .getchildren()[0].getchildren()[4].text[124:136])
'Bibliothèque'
sourcefilter commented 6 years ago

I'll write a test that captures this case, then submit a pull request along with @gugek's fix!

mloesch commented 6 years ago

Fixed in the latest release 0.6.3