infrae / pyoai

The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"
http://pypi.python.org/pypi/pyoai
Other
83 stars 53 forks source link

Loop over all records ? #57

Open PBrockmann opened 1 year ago

PBrockmann commented 1 year ago

How to loop over all records ? My script stops after 50 records. How to get the number of records as well ?

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://ws.pangaea.de/oai/provider?set=project4173'

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

record = client.listRecords(metadataPrefix='oai_dc')

for record in client.listRecords(metadataPrefix='oai_dc'):
    print(record)

There are 1501 records in the project.

$ oai-harvest --limit 10000 -p dif --set project4173 http://ws.pangaea.de/oai/provider

This command harvests correctly all metadata from the 1501 records. I would like to do this from oaipmh to save them after some reformating into a json file.

Any help welcomed.

davidroncero commented 1 year ago

Hi there,

I am not an expert on pyoai. I started using it today but I share with you some ideas to explore.

First is that you are calling twice the function listRecords

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://ws.pangaea.de/oai/provider?set=project4173'

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

record = client.listRecords(metadataPrefix='oai_dc') # <----- first call

for record in client.listRecords(metadataPrefix='oai_dc'): # <----- first call
    print(record)

Even if you remove that you only receive 50 records. When running your code you receive this error:

oaipmh.error.BadArgumentError: You cannot use other request parameters when a resumptionToken is given.

As detailed in the documentation resumptionToken in listRecords verb is:

resumptionToken an exclusive argument with a value that is the flow control token returned by a previous ListRecords request that issued an incomplete list.

Maybe there is something wrong in the server.

The OAI PMH documentations talks about Flow Control:

https://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl

Hope this helps.

wetneb commented 1 year ago

Side comment: this library has a number of deficiencies, which would require introducing breaking changes to fix them. I would recommend to use other libraries (actively developed and maintained) if you are writing new code. Maybe it would be useful to have a note about that in the README.

PBrockmann commented 1 year ago

Thanks for answering. Indeed the first call to listRecords could be removed. As you have written, I get the same error anyway.

I will think either to use the command oai-harvest that works nicelly as a first step for my processing or another API indeed.

davidroncero commented 1 year ago

@wetneb do you know any other libraries for Python you can recommend us?

Thanks in advance.

wetneb commented 1 year ago

I am not sure! I thought there was an actively maintained one but all I can find is sickle, which seems a bit dormant but might still be a much better bet than pyoai. I haven't tried it myself though.

davidroncero commented 1 year ago

I'll take a look to that one.

Thank you very much.