sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

QNL OAI server sometimes throws exceptions #200

Closed edsu closed 2 years ago

edsu commented 2 years ago

Here's a screen shot of an error from QNL's OAI-PMH server:

Screen Shot 2022-09-02 at 5 50 12 PM

I've noticed that you can generate these programmatically using Sickle to just walk through the records that come back from a ListRecords call. It doesn't seem to be deterministic when the errors get thrown. Perhaps we can put a sleep in between requests to be gentler? Or maybe there's a fix they can do on their side?

import sickle

oai = sickle.Sickle("https://api.qdl.qa/api/oaipmh")
for rec in oai.ListRecords(metadataPrefix="mods_no_ocr"):
    print(rec.header)
jacobthill commented 2 years ago

@edsu, I have contacted one of their developers (Arif Shaon) and invited him to take a look at this ticket and help us come up with a strategy for solving this issue.

aaron-collier commented 2 years ago

@jacobthill I'm assigning this to you for now pending the results of your comment above and any information from Arif. Once we have more information, we can resolve or write more tickets.

edsu commented 2 years ago

@jacobthill if it would be helpful for Arif to chat about this problem and how to reproduce let me know, I'd be happy to do a call.

edsu commented 2 years ago

Actually, it looks like the script above is working again (no server exceptions), but it runs for a long time! Maybe Arif was able to fix something on their side?

bin/get qnl qnl --limit 100 also generates some CSV output.

jacobthill commented 2 years ago

I don't think QNL has made any changes; at least they haven't told me of any. This is pretty typical behavior from my experience. Sometimes collections will complete and sometime the server will throw an error. I reminded Arif of this issue so hopefully we'll here back soon but I'm hoping if we improve our http request retries and data partners do what they can, we will see this issue a lot less.

edsu commented 2 years ago

I ran the script above using Sickle to iterate through all the responses. It ran for a little over two hours, found 44,570 records, but died with this exception about the resumption token being invalid:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/sul-dlss/dlme-airflow/qnl.py", line 5, in <module>
    for rec in oai.ListRecords(metadataPrefix="mods_no_ocr"):
  File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 52, in __next__
    return self.next()
  File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 151, in next
    self._next_response()
  File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 138, in _next_response
    super(OAIItemIterator, self)._next_response()
  File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 91, in _next_response
    raise getattr(
sickle.oaiexceptions.BadResumptionToken: The value of the resumptionToken argument is invalid or expired.
arifshaon commented 2 years ago

Apologies for the delay in replying. I have had our IT look into this issue and they have discovered that this is caused by the low threshold (1000 requests per 30 mins) set for the no. of HTTP requests to the OAI-PMH Azure function. Unfortunately, they cannot increase the threshold as it has cost implications. So, they have advised to add a minimum 2 sec delay between every HTTP GET request from the same IP address. Please let me know if that solves the problem.

jacobthill commented 2 years ago

30 minutes x 60 seconds = 1800 seconds / 2 = 900 records per 30 minutes. We have definitely been able to harvest more than 1000 records per 30 minutes in the last couple weeks. @arifshaon did you maybe mean 10,000? 2 seconds would get us under 1000 per 30 minutes but take ~10 hours to complete, which, i guess is ok.

edsu commented 2 years ago

Just fyi, adding a sleep timer to our OAI intake driver, and allowing it to be configured in the Collection YAML should be relatively easy.

jacobthill commented 2 years ago

Thanks @edsu. I think we should go ahead and do it. We will need it for other providers and I can work out the timing with each provider. I opened a ticket https://github.com/sul-dlss/dlme-airflow/issues/235

arifshaon commented 2 years ago

Hi @jacobthill - as far as I understand from our IT infrastructure guys, the HTTP GET requests to the OAI-PMH app are throttled at 1000 per 30 mins. This is not 1000 metadata records in OAI-PMH ListRecords response but 1000 HTTP GET requests to the app. For example, if you submit 999 ListRecords verbs followed by 2 Identify verbs within a 30 min period, the 2nd identify request may be throttled. Hope this helps.