mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

Sickle not retrieving all records from repository #45

Closed SigurdG closed 3 years ago

SigurdG commented 4 years ago

I have been working on retrieving all records from an OAI-PHM repository from various research institutions using the Sickle program in Python. I have written a code that performs a consecutive harvesting that iterates over the records of the various repositories and saves the records as an XML-file as well as into a SQL-data. Below is an excerpt of the code that specifies the consecutive harvesting of the OAI repository from a smaller research institution.

However, for some reason I am unable to retrieve all the records in the repositories. In the given example below for one institution, I am only able to retrieve around 2.900 records from the repository even though the completeListSize is 4.041 last time I checked. If I use the from parameter and perform a series of selective harvesting by date in a loop, I am able to retrieve some additional records but not all of them.

The OAI interface appears to be sending back an empty resumptionToken indicating that all records have been retrieved and therefore no errors are raised. I suspect the issue might be due to the fact that some of the records in the OAI repository are somehow empty or incomplete and that program therefore believes that all records in the repository has been retrieved. A similar but not identical issue with resumptionTokens have been raised in #25 but in that case the sickle program raised an issue.

I am unsure if it’s possible to solve the issue by adding an additional parameter that skips a record that is empty or issues a repeat request or something along those lines?


from sickle import Sickle
import re
import uuid
import pyodbc
import xml.dom.minidom
import xml.sax

api_list = [ \
"https://pure.itu.dk/ws/oai", \
]

date="2020-08.01"
last_retrieval="1950.01.01"

for api in api_list:
    institution = ""
    institution = inst_institution(api)
    record_total=0
    sickle = Sickle(api) 

    harvest_id = uuid.uuid4() # generating a random ID for the record. 

    recs = sickle.ListRecords(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date})
    headers = sickle.ListIdentifiers(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date})
    for header in headers:
        record_total = record_total + 1
        try:    
            r=recs.next()

        except IndexError:
            record_fail_total = record_fail_total + 1
            failed_record_function(harvest_id, Sidste_indhentning, dagsdato, api, institution, record_fail_total, day_of_harvest) # Failed records being saved to SQL table ”records_failed” 

         rec_id = re.search('rec_id=' + chr(34) + '(.+?)' + chr(34) + ' rec_created=', str(r)).group(1)
        print (str(record_total) + " - " + str(rec_id) + " - " + str(institution)) #save a XML-file for each record
        Fil_placering = r"C:\Users\sigur\OneDrive\Skrivebord\Data\\itu\\" + str(rec_id) + ".xml"
        with open(r"C:\Users\sigur\OneDrive\Skrivebord\Data\\itu\\" + str(rec_id) + ".xml", "w", encoding="UTF-8") as text_file:
            print(str(r), file=text_file)
mloesch commented 3 years ago

I cannot reproduce this, for me the resumption token says 4404 records. https://pure.itu.dk/ws/oai?verb=ListRecords&metadataPrefix=ddf-mxd

In [1]: from sickle import Sickle

In [2]: s = Sickle("https://pure.itu.dk/ws/oai")

In [3]: len(list(s.ListRecords(metadataPrefix='ddf-mxd')))
Out[3]: 4404