mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

ListRecords not picking up on resumption token #29

Closed jacobthill closed 5 years ago

jacobthill commented 5 years ago

This is most likely a user error but I've been through the docs, issues, etc. and can't figure this out. I am expecting this to use the resumption token to continue to retrieve the next set of records but it doesn't. Any pointers would be appreciated. Here is my code:


import errno
import os
from lxml import etree
from sickle import Sickle
from sickle.iterator import OAIResponseIterator

def to_str(bytes_or_str):
    '''Takes bytes or string and returns string'''
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value  # Instance of str

sickle = Sickle('http://cdm21044.contentdm.oclc.org/oai/oai.php', iterator=OAIResponseIterator)

sets = ['Kitapvehat', 'ResimKlksyn', 'emirgan', 'abidindino']

for item in sets:
    records = sickle.ListRecords(metadataPrefix='oai_dc', set=item)
    file_name = '{}/data/{}.xml'.format(item, item)
    if not os.path.exists(os.path.dirname(file_name)):
        try:
            os.makedirs(os.path.dirname(file_name))
        except OSError as exc: # Guard against race condition
            if exc.errno != errno.EEXIST:
                raise

    with open(file_name, 'w') as f:
        f.write(to_str(records.next().raw.encode('utf8')))

    f.close()```
mloesch commented 5 years ago

You are only loading the first page by calling records.next() once per set.

Try iterating over the pages instead:

for page in records:
    # save the page
jacobthill commented 5 years ago

Thank you, that worked except now a 'StopIteration' exception is raised when the script hits the last page and there is no resumptionToken so I can't iterate over a list of sets, harvest each set and write them to separate files.

Traceback (most recent call last):
  File "sakip-sabanci-harvest.py", line 31, in <module>
    f.write(to_str(records.next().raw.encode('utf8')))
  File "/anaconda3/envs/work/lib/python3.7/site-packages/sickle/iterator.py", line 115, in next
    raise StopIteration
StopIteration

I can work around this if necessary by writing a separate script for each set, but is there another way to achieve this that allows me to loop over the list of sets?

Also the xml declaration is written to file with each resumptionToken which is causing lxml to complain when I try to parse the file.

 </OAI-PMH><?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-02-06T16:59:29Z</responseDate><request verb="ListRecords" resumptionToken="Kitapvehat:600:Kitapvehat:0000-00-00:9999-99-99:oai_dc" metadataPrefix="oai_dc" set="Kitapvehat">http://cdm21044.contentdm.oclc.org/oai/oai.php</request>

How can I write the records without the xml declaration being written multiple times in the out file?

jacobthill commented 5 years ago

Never mind, I think I have it figured out. Here is my code in case someone else ever has the same need:

import errno
from sickle import Sickle

def to_str(bytes_or_str):
    '''Takes bytes or string and returns string'''
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value  # Instance of str

sickle = Sickle('http://cdm21044.contentdm.oclc.org/oai/oai.php')
sets = sickle.ListSets()
for s in sets:
    records = sickle.ListRecords(metadataPrefix="oai_dc", set=s.setSpec, ignore_deleted=True)
    file_count = 1
    for record in records:
        if not os.path.exists(os.path.dirname('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count))):
            try:
                os.makedirs(os.path.dirname('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count)))
            except OSError as exc: 
                if exc.errno != errno.EEXIST:
                    raise
        with open('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count), 'w') as f:
            file_count += 1
            f.write(to_str(record.raw.encode('utf8')))
            f.close()