mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

Sickle throws an exception if resumption token was repeated #25

Closed weirdf0x closed 6 years ago

weirdf0x commented 6 years ago

I have a problem with harvesting a OAI repository. After the last valid record set is downloaded, the repository (DSpace based) sends the same resumption token. This makes Sickle send another request with no metadata_prefix, which causes an error with the repository. Any idea how to fix this?

from sickle import Sickle
from sickle.iterator import OAIResponseIterator

sickle = Sickle('https://www.ssoar.info/OAIHandler/request', iterator=OAIResponseIterator)

for record_set in sickle.ListRecords(metadataPrefix='oai_genios', ignore_deleted=True):
    print(record_set)
mloesch commented 6 years ago

This repository returns a resumption token element without a body in the last response.

I don't think that this implementation is correct as to my knowledge the last response should not contain a resumption token element at all.

You could hack your way around this by monitoring the resumption token for an empty body:

sickle = Sickle('https://www.ssoar.info/OAIHandler/request', iterator=OAIResponseIterator)

iterator = sickle.ListRecords(metadataPrefix='oai_genios')

for record_set in iterator:
    print(record_set)
    if iterator.resumption_token and not iterator.resumption_token.token:
        # resumption token with empty body means last response    
        break
    print(iterator.resumption_token)

BTW the ignore_deleted parameter is not supported for OAIResponseIterator, so it does not have any effect.

mloesch commented 6 years ago

Actually this is a bug in Sickle that has been fixed in #4 for OAIItemIterator but not for OAIResponseIterator. It will be fixed in the next release.

The workaround documented in my previous comment can be used for now

weirdf0x commented 6 years ago

Thanks for the response, the workaround and the fix!

mloesch commented 6 years ago

Fixed in release 0.6.4