infrae / pyoai

The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"
http://pypi.python.org/pypi/pyoai
Other
83 stars 53 forks source link

Error in makeRequestErrorHandling from a listRecords call with from_ parameter #33

Closed fxcoudert closed 6 years ago

fxcoudert commented 6 years ago

This very simple code is requesting records from a figshare set:

#!/usr/bin/env python3

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

import datetime

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client('https://api.figshare.com/v2/oai', registry)

month_ago = datetime.datetime.now() - datetime.timedelta(days=30)
for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  print(record[0].datestamp(), end=' ')
  print(record[1]['title'][0])

After finding several records, the code throws an exception with the following error:

Traceback (most recent call last):
  File "./toto.py", line 13, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 365, in ResumptionListGenerator
    result, token = nextBatch(token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 194, in nextBatch
    resumptionToken=token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
    raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)
oaipmh.error.NoRecordsMatchError: The result in an empty list.

If I remove the from_ parameter from the listRecords call, it all works fine.

fxcoudert commented 6 years ago

I should also state that, even before breaking, the server seems to return 3 records whose datestamps do not match the requested from_ parameters:

2018-05-17 13:10:54 Quantitative Characterization of Molecular-Stream Separation
2018-01-10 15:47:36 Melting of zeolitic imidazolate frameworks with different topologies: insight from first-principles molecular dynamics
2017-09-07 20:44:45 Facile Fabrication of Ultralow-Density Transparent Boehmite Nanofiber Cryogel Monoliths and Their Application in Volumetric Three-Dimensional Displays

Probably not related, and not as annoying as a crash, but still…

jascoul commented 6 years ago

When I run your code, I get 67 results, so I can't reproduce it. The NoRecordsMatch error gets raised when the server returns no results, this is part of the OAIPMH protocol.

I also got the 3 records with the wrong timestamp. The server should not have returned those. These seem to be problems with the figshare api and not with this library.

fxcoudert commented 6 years ago

I understand that NoRecordsMatch should be returned when the server returns no results. The bug here is that, sometimes, the pyoai library raises this error while the server did return results.

In fact, from my testing it appears the NoRecordsMatch occurs when (and only when) the number of records returned is an exact multiple of ten. I thus suspect this is a pagination bug.

fxcoudert commented 6 years ago

Using from_ and until to craft a time range for which there is exactly 10 results shows the bug:

bli /tmp $ cat a.py 
#!/usr/bin/env python3

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

import datetime

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client('https://api.figshare.com/v2/oai', registry)

f = datetime.datetime.strptime('2018-07-24 14:56:00', '%Y-%m-%d %H:%M:%S')
u = datetime.datetime.strptime('2018-07-27 15:00:00', '%Y-%m-%d %H:%M:%S')
for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=f, until=u):
  print(record[0].datestamp(), end=' ')
  print(record[1]['title'][0])

which gives:

bli /tmp $ ./a.py  
2018-07-27 14:00:34 Boehmite Nanofiber-Reinforced Resorcinol-Formaldehyde Macroporous Monoliths for Heat/Flame Protection
2018-07-26 21:31:00 Theory of the reactant-stationary kinetics for zymogen activation coupled to  an enzyme catalyzed reaction
2018-07-26 16:57:19 Facile Synthesis of a Diverse Library of Mono-3-substituted β-Cyclodextrin Analogues
2018-07-26 14:00:22 Computationally-Inspired Discovery of an Unsymmetrical Porous Organic Cage
2018-07-26 13:57:17 Unzipping Natural Products: Improved Natural Product Structure Predictions by Ensemble Modeling and Fingerprint Matching
2018-07-25 18:45:57 Air Quality in Puerto Rico in the Aftermath of Hurricane Maria: A Case Study on the Use of Lower-Cost Air Quality Monitors
2018-07-25 15:08:33 Magnetic Structure of UO2 and NpO2 by First-Principle Methods
2018-07-25 15:06:07 Tailing miniSOG: Structural Bases of the Complex Photophysics of a Flavin-Binding Singlet Oxygen Photosensitizing Protein
2018-07-25 14:31:52 On-Surface Radical Oligomerisation: A New Approach to STM Tip-Induced Reactions
2018-07-24 14:56:00 Hue Parameter Fluorescence Identification of Edible Oils with a Smartphone
Traceback (most recent call last):
  File "./a.py", line 14, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=f, until=u):
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 365, in ResumptionListGenerator
    result, token = nextBatch(token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 194, in nextBatch
    resumptionToken=token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
    raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)
oaipmh.error.NoRecordsMatchError: The result in an empty list.

With this from_/until specification, it should be reproducible for you. I hope you can reopen the bug.