mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

Resumption Token with until #56

Closed edsu closed 1 year ago

edsu commented 2 years ago

I'm not sure if this is a problem with a particular OAI endpoint I am working with, or with Sickle (although I'm leaning towards the former). I'm trying to selectively harvest an endpoint using an until timestamp:

import logging
from sickle import Sickle

logging.basicConfig(level=logging.DEBUG)

sickle = Sickle("https://api.qdl.qa/api/oaipmh")
records = sickle.ListRecords(
    metadataPrefix='mods_no_ocr',
    until="2019-10-15T19:00:00Z"
)

for rec in records:
    print(rec.header.xml.find('{http://www.openarchives.org/OAI/2.0/}datestamp').text)

When I run this I see:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?metadataPrefix=mods_no_ocr&until=2019-10-15T19%3A00%3A00Z&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T16:43:48.818Z
2019-10-15T16:43:48.818Z
2019-10-15T16:45:27.094Z
2019-10-15T16:45:27.094Z
2019-10-15T16:46:40.424Z
2019-10-15T16:46:40.424Z
2019-10-15T16:52:13.539Z
2019-10-15T16:52:13.539Z
2019-10-15T17:08:29.977Z
2019-10-15T17:08:29.977Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=10mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T16:52:13.539Z
2019-10-15T16:52:13.539Z
2019-10-15T17:08:29.977Z
2019-10-15T17:08:29.977Z
2019-10-15T18:46:15.172Z
2019-10-15T18:46:15.172Z
2020-05-24T04:16:31.944Z
2020-05-24T04:16:31.944Z
2019-10-15T18:52:19.668Z
2019-10-15T18:52:19.668Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=20mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T18:52:34.072Z
2019-10-15T18:52:34.072Z
2019-10-15T18:52:46.162Z
2019-10-15T18:52:46.162Z
2019-10-15T18:53:08.176Z
2019-10-15T18:53:08.176Z
2019-10-15T18:53:28.807Z
2019-10-15T18:53:28.807Z
2019-11-07T12:37:23.928Z
2019-11-07T12:37:23.928Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=30mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-10-15T18:55:50.294Z
2019-10-15T18:55:50.294Z
2020-05-27T08:48:04.998Z
2020-05-27T08:48:04.998Z
2019-11-13T10:41:20.582Z
2019-11-13T10:41:20.582Z
2019-11-14T10:39:25.215Z
2019-11-14T10:39:25.215Z
2020-08-28T13:04:28.351Z
2020-08-28T13:04:28.351Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=40mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2019-11-13T16:59:41.238Z
2019-11-13T16:59:41.238Z
2019-11-13T14:19:06.199Z
2019-11-13T14:19:06.199Z
2020-08-28T13:21:22.953Z
2020-08-28T13:21:22.953Z
2020-09-28T15:30:41.700Z
2020-09-28T15:30:41.700Z
2020-09-28T15:48:37.471Z
2020-09-28T15:48:37.471Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=50mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2022-02-14T20:44:17.077Z
2022-02-14T20:44:17.077Z
2022-02-14T20:44:24.159Z
2022-02-14T20:44:24.159Z
2022-02-14T20:44:17.142Z
2022-02-14T20:44:17.142Z
2022-02-14T20:44:52.422Z
2022-02-14T20:44:52.422Z
2022-02-14T20:44:59.224Z
2022-02-14T20:44:59.224Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443
DEBUG:urllib3.connectionpool:https://api.qdl.qa:443 "GET /api/oaipmh?resumptionToken=60mods_no_ocr&verb=ListRecords HTTP/1.1" 200 None
2022-03-28T11:08:22.770Z
2022-03-28T11:08:22.770Z
2022-02-14T20:45:42.845Z
2022-02-14T20:45:42.845Z
2022-02-14T20:46:43.260Z
2022-02-14T20:46:43.260Z
2022-02-14T20:46:54.444Z
2022-02-14T20:46:54.444Z
2022-05-25T05:29:34.465Z
2022-05-25T05:29:34.465Z
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.qdl.qa:443

The timestamps for the records clearly show that the server isn't respecting the until value as it uses the resumptionToken. But I noticed that if I manually craft a URL that includes until with the resumptionToken that it seems to work properly, since it returns the next 10 records in the set of 52?

https://api.qdl.qa/api/oaipmh?resumptionToken=10mods_no_ocr&verb=ListRecords&until=2019-10-15T19%3A00%3A00Z

My understanding from the specification is that calls to ListRecords with the resumptionToken shouldn't include until because resumptionToken is exclusive? So it appears that Sickle is behaving properly and the server is broken?

Any help confirming this conclusion would be greatly appreciated.

PS. Thank you for a rock solid and extensible OAI-PMH library!

arifshaon commented 1 year ago

H @edsu , Apologies for the long delay in getting back to you on this. Looking at the implementation, it does seem to support a combination of resumptionToken and until as it re-uses the logic for ListIdentifiers before fetching the records requested. Please see below the code snipped:

def listRecords(self, metadataPrefix, resumptionToken=None,
                    from_=None, until=None):
        """Get a list of header, metadata and about information on records.

        Args:
            metadataPrefix (string): identifies metadata set to retrieve
            resumptionToken (string): the resumptionToken

        Should raise error.CannotDisseminateFormatError if metadataPrefix
        is not supported by the repository.

        Should raise error.NoSetHierarchyError if the repository does not
        support sets.

        Returns:
            string: the response
        """
        root = self.getRootLxmlNamespace()

        request = ET.Element(
            'request',
            verb='ListRecords')
        request.text = BASE_URL
        request.attrib['metadataPrefix'] = metadataPrefix

        root.append(request)

        listRecords = ET.Element('ListRecords')

        start = int(resumptionToken)

        identifiers_data = self.solr.get_list_identifiers(start, from_, until)
        identifiers = identifiers_data['docs']
        numFound = identifiers_data['numFound']

        if ((start + SOLR_ROWS) > numFound):
            raise ErrorHandler(ErrorCode.BAD_RESUMPTIONTOKEN, None)

        for identifier in identifiers:
            language = self.get_language_from_identifier(identifier['id'])
            if not language:
                raise ErrorHandler(ErrorCode.ID_DOES_NOT_EXIST, None)

            ead_data, image_path, source_content_type, userestrict = \
                self.solr.get_metadata(identifier['id'][:-3], language)
            if language == 'en':
                source_content_type_en = source_content_type
            else:
                source_content_type_en = \
                    self.solr.get_metadata(identifier['id'][:-3], 'en')[2]
            if not ead_data:
                raise ErrorHandler(ErrorCode.ID_DOES_NOT_EXIST,
                                   None,
                                   {'verb': 'ListRecords',
                                    'identifier': identifier})

            mods = self.GetRecordData(identifier['id'][:-3], language,
                                      ead_data, metadataPrefix, image_path,
                                      source_content_type,
                                      source_content_type_en, userestrict)
            listRecords.append(mods)

        resumptionToken_element = ET.Element('resumptionToken')
        resumptionToken_element.attrib['completeListSize'] = str(numFound)
        resumptionToken_element.attrib['cursor'] = str(start)
        if (start + SOLR_ROWS) != numFound:
            resumptionToken_element.text = str(start + SOLR_ROWS) +  \
                                           metadataPrefix
        listRecords.append(resumptionToken_element)

        root.append(listRecords)
        return ET.tostring(root, pretty_print=True, xml_declaration=True,
                           encoding='utf-8').decode()
if querystring['verb'].lower() == 'listrecords':
                logging.info("Listing records...")
                if 'resumptiontoken' in querystring:
                    if 'metadataprefix' in querystring:
                        raise ErrorHandler(ErrorCode.BAD_ARGUMENT, event)
                    else:
                        return OaiPmh.listRecords(
                                metadataPrefix,
                                resumptionToken,
                                querystring.get('from', None),
                                querystring.get('until', None))
                elif 'metadataprefix' in querystring:
                    return OaiPmh.listRecords(
                        querystring['metadataprefix'],
                        querystring.get('resumptiontoken', 0),
                        querystring.get('from', None),
                        querystring.get('until', None))
                else:
                    raise ErrorHandler(ErrorCode.BAD_ARGUMENT, event)
edsu commented 1 year ago

It seems that this is a problem with the OAI server in question and not a problem with Sickle, so I'm closing this.