Closed edsu closed 2 years ago
@edsu, I have contacted one of their developers (Arif Shaon) and invited him to take a look at this ticket and help us come up with a strategy for solving this issue.
@jacobthill I'm assigning this to you for now pending the results of your comment above and any information from Arif. Once we have more information, we can resolve or write more tickets.
@jacobthill if it would be helpful for Arif to chat about this problem and how to reproduce let me know, I'd be happy to do a call.
Actually, it looks like the script above is working again (no server exceptions), but it runs for a long time! Maybe Arif was able to fix something on their side?
bin/get qnl qnl --limit 100
also generates some CSV output.
I don't think QNL has made any changes; at least they haven't told me of any. This is pretty typical behavior from my experience. Sometimes collections will complete and sometime the server will throw an error. I reminded Arif of this issue so hopefully we'll here back soon but I'm hoping if we improve our http request retries and data partners do what they can, we will see this issue a lot less.
I ran the script above using Sickle to iterate through all the responses. It ran for a little over two hours, found 44,570 records, but died with this exception about the resumption token being invalid:
Traceback (most recent call last):
File "/Users/edsummers/Projects/sul-dlss/dlme-airflow/qnl.py", line 5, in <module>
for rec in oai.ListRecords(metadataPrefix="mods_no_ocr"):
File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 52, in __next__
return self.next()
File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 151, in next
self._next_response()
File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 138, in _next_response
super(OAIItemIterator, self)._next_response()
File "/Users/edsummers/Library/Caches/pypoetry/virtualenvs/dlme-airflow-4uPLCoq4-py3.10/lib/python3.10/site-packages/sickle/iterator.py", line 91, in _next_response
raise getattr(
sickle.oaiexceptions.BadResumptionToken: The value of the resumptionToken argument is invalid or expired.
Apologies for the delay in replying. I have had our IT look into this issue and they have discovered that this is caused by the low threshold (1000 requests per 30 mins) set for the no. of HTTP requests to the OAI-PMH Azure function. Unfortunately, they cannot increase the threshold as it has cost implications. So, they have advised to add a minimum 2 sec delay between every HTTP GET request from the same IP address. Please let me know if that solves the problem.
30 minutes x 60 seconds = 1800 seconds / 2 = 900 records per 30 minutes. We have definitely been able to harvest more than 1000 records per 30 minutes in the last couple weeks. @arifshaon did you maybe mean 10,000? 2 seconds would get us under 1000 per 30 minutes but take ~10 hours to complete, which, i guess is ok.
Just fyi, adding a sleep timer to our OAI intake driver, and allowing it to be configured in the Collection YAML should be relatively easy.
Thanks @edsu. I think we should go ahead and do it. We will need it for other providers and I can work out the timing with each provider. I opened a ticket https://github.com/sul-dlss/dlme-airflow/issues/235
Hi @jacobthill - as far as I understand from our IT infrastructure guys, the HTTP GET requests to the OAI-PMH app are throttled at 1000 per 30 mins. This is not 1000 metadata records in OAI-PMH ListRecords response but 1000 HTTP GET requests to the app. For example, if you submit 999 ListRecords verbs followed by 2 Identify verbs within a 30 min period, the 2nd identify request may be throttled. Hope this helps.
Here's a screen shot of an error from QNL's OAI-PMH server:
I've noticed that you can generate these programmatically using Sickle to just walk through the records that come back from a ListRecords call. It doesn't seem to be deterministic when the errors get thrown. Perhaps we can put a sleep in between requests to be gentler? Or maybe there's a fix they can do on their side?