mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

AttributeError: 'NoneType' object has no attribute 'find' #35

Closed jacobthill closed 4 years ago

jacobthill commented 4 years ago

I am unsure what the problem is but I keep getting the following error when trying to harvest a collection from Qatar Digital Library. I have to harvest through a whitelisted server, so unfortunately, no one will be able to test but I'm hoping someone has a better instinct about why I'm getting this error and, more importantly, how to avoid it. The last time I harvested these records there were more that 32k but I keep getting this error on number 18,108. I would like to just pass over this record (and any other record with a similar problem) and harvest the rest of them but the script always stops on this record. Here is the complete error message:

Traceback (most recent call last):
  File "qnl-harvest.py", line 26, in <module>
    for count, record in enumerate(records, start=1):
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 52, in __next__
    return self.next()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 151, in next
    self._next_response()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 138, in _next_response
    super(OAIItemIterator, self)._next_response()
  File "/opt/app/harvester/.local/lib/python3.4/site-packages/sickle/iterator.py", line 85, in _next_response
    error = self.oai_response.xml.find(
AttributeError: 'NoneType' object has no attribute 'find'

Here is my script:

import errno, os
from sickle import Sickle
from sickle.iterator import OAIResponseIterator

# where to write data to (relative to the dlme-harvest repo folder)
base_output_folder = 'output'

sickle = Sickle('https://api.qdl.qa/oaipmh')
print("Sickle instance created.") # status update

records = sickle.ListRecords(metadataPrefix='mods', ignore_deleted=True)
print("Records created.") # status update

directory = "output/qnl/data/"
os.makedirs(os.path.dirname(directory), exist_ok=True)

for count, record in enumerate(records, start=1):
    try:
        print("Record number " + str(count))
        out_file = 'output/qnl/data/qnl-{}.xml'.format(count)
        directory_name = os.path.dirname(out_file)
        with open(out_file, 'w') as f:
            f.write(record.raw)
    except Exception as err:
        print(err)
mloesch commented 4 years ago

Cannot reproduce this because the OAI interface is restricted. I suspect that the interface returns an empty response, something like:

</>

Sickle uses an XML parser that forgives some flaws in the XML structure. This response will cause the parsed result to be None:

>>> XMLParser = etree.XMLParser(remove_blank_text=True, recover=True, resolve_entities=False)
>>> type(etree.XML('</>', parser=XMLParser)
NoneType