Closed jacobthill closed 5 years ago
You are only loading the first page by calling records.next()
once per set.
Try iterating over the pages instead:
for page in records:
# save the page
Thank you, that worked except now a 'StopIteration' exception is raised when the script hits the last page and there is no resumptionToken so I can't iterate over a list of sets, harvest each set and write them to separate files.
Traceback (most recent call last):
File "sakip-sabanci-harvest.py", line 31, in <module>
f.write(to_str(records.next().raw.encode('utf8')))
File "/anaconda3/envs/work/lib/python3.7/site-packages/sickle/iterator.py", line 115, in next
raise StopIteration
StopIteration
I can work around this if necessary by writing a separate script for each set, but is there another way to achieve this that allows me to loop over the list of sets?
Also the xml declaration is written to file with each resumptionToken which is causing lxml to complain when I try to parse the file.
</OAI-PMH><?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-02-06T16:59:29Z</responseDate><request verb="ListRecords" resumptionToken="Kitapvehat:600:Kitapvehat:0000-00-00:9999-99-99:oai_dc" metadataPrefix="oai_dc" set="Kitapvehat">http://cdm21044.contentdm.oclc.org/oai/oai.php</request>
How can I write the records without the xml declaration being written multiple times in the out file?
Never mind, I think I have it figured out. Here is my code in case someone else ever has the same need:
import errno
from sickle import Sickle
def to_str(bytes_or_str):
'''Takes bytes or string and returns string'''
if isinstance(bytes_or_str, bytes):
value = bytes_or_str.decode('utf-8')
else:
value = bytes_or_str
return value # Instance of str
sickle = Sickle('http://cdm21044.contentdm.oclc.org/oai/oai.php')
sets = sickle.ListSets()
for s in sets:
records = sickle.ListRecords(metadataPrefix="oai_dc", set=s.setSpec, ignore_deleted=True)
file_count = 1
for record in records:
if not os.path.exists(os.path.dirname('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count))):
try:
os.makedirs(os.path.dirname('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count)))
except OSError as exc:
if exc.errno != errno.EEXIST:
raise
with open('{}/data/{}-{}.xml'.format(s.setSpec, s.setSpec, file_count), 'w') as f:
file_count += 1
f.write(to_str(record.raw.encode('utf8')))
f.close()
This is most likely a user error but I've been through the docs, issues, etc. and can't figure this out. I am expecting this to use the resumption token to continue to retrieve the next set of records but it doesn't. Any pointers would be appreciated. Here is my code: