openstate / open-cultuur-data

The back- and front-end code that powers the Open Cultuur Data API
http://opencultuurdata.nl/
28 stars 18 forks source link

OAI-PMH implementation question #18

Closed coret closed 10 years ago

coret commented 10 years ago

Any specific reason why (in extractor/oai.py) the OAI harvester is implemented like

verb=ListIdentifiers
   foreach identifier verb=GetRecord(identifier)  [= 1+x requests]

instead of (in terms of requests more efficient)

verb=ListRecords  [= 1 bigger request]
breyten commented 10 years ago

It's not necessarily a given that ListIdentifiers will return the same metadata as the GetRecord verb :)

coret commented 10 years ago

I'm not sure we're on the same frequency...

If you do a ListIdentifier and the subsequent GetRecords request you get the same metadata as you would with ListRecords (but then in one request you get all of the records), the latter will however need less HTTP requests.

justinvw commented 10 years ago

Hi Bob,

Yes, there is a reason for performing an individual request for every object that is encountered in the ListIdentifiers response. One of our aims is to store the object's original representation along the modifications that are made within transformation process. Please see the documentation of the source REST endpoint for an example.

If we would only use the ListRecords verb and yield every <record/>, we would mis important information to parse the data when requesting it via the REST API. For example, the root element that defines the namespaces would be missing.

I will add a comment to the extractor that explains this :wink:.

coret commented 10 years ago

Hi Justin,

By comparing a GetRecord (http://api.openarch.nl/oai-pmh/?verb=GetRecord&metadataPrefix=oai_a2a&identifier=elo:000002e0-e965-e554-0ffd-fae16753075a) with a ListRecords (http://api.openarch.nl/oai-pmh/?verb=ListRecords&metadataPrefix=oai_a2a) the difference in root element can indeed be noticed. But, as OAI is like an enveloppe with data in it (the record/) you could argue that storing the record/ is enough. We also don't store the HTTP headers (also like an enveloppe) of requests.

But my main worry is performance. Open Archives currently has 6M items, harvesting these at a rate of 20 items per second will make the total harvesting time about 83 hours...

My feeling is that the ListRecords method is faster (less HTTP overhead). I'll adjust to oai extractor on my machine to see if the ListRecords method is faster indeed or not ...

coret commented 10 years ago

See my blogposting What's the best way to harvest an OAI-PMH data provider?

breyten commented 10 years ago

Nice work!

But nobody was contesting that ListRecords wouldn't be faster ;)

The problem isn't that the XML isn't different, but that the extractor spits out text. In the case of ListRecords it would spit out partial XML (And thus leaving out the namespace information).

A patch would be very welcome, but it would need to handle this in a clean way.

coret commented 10 years ago

It's a little bit off-topic, but I checked the HTTP headers of requests being made (in oai.py that is) and it shows: {'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'User-Agent': 'python-requests/2.2.1 CPython/2.7.3 Linux/3.2.0-4-amd64'} So, compression is already used by default (of course, only if the data provider supports compression). Great!

We could add the {"Connection": "keep-alive"} to the set of headers. But I'm not sure if this would require other changes to the code.

We should however add a proper user-agent to identify our OAI extractor, like "User-Agent": "OpenCultuurData-OAI-extractor/0.1"

coret commented 10 years ago

I think the following modification of oai.py would suffice to switch from ListIdentifiers/GetRecord to a ListRecords implementation:

       - req_params = {'verb': 'ListIdentifiers'}  
       + req_params = {'verb': 'ListRecords'}

       - record_ids = tree.xpath('//oai:header/oai:identifier/text()',
       -                         namespaces=self.namespaces)
       - for record_id in record_ids:
       -     record = self.oai_call({
       -         'verb': 'GetRecord',
       -         'identifier': record_id,
       -         'metadataPrefix': self.metadata_prefix
       -     })
       -     yield 'application/xml', record
       + records = tree.findall('.//oai:ListRecords/oai:record', 
       +              namespaces=self.namespaces)
       + if records is not None:
       +     for record in records:
       +         yield 'application/xml', etree.tostring(record)

I've tested this change: the openbeelden and openarchieven items do not require any change! Data is extracted, transformed and loaded as usual. But please take a thorough look at the code, as I am no Python programmer ;-)

There is noticable one change in the ocd_openarchieven (and ocd_openbeeld) index, in the 'original' data field.

Before (record contents not shown, as they are the same before and after):

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2014-05-26T15:35:45Z</responseDate> <request verb="GetRecord" metadataPrefix="oai_a2a" identifier="ec474a4c-5c4c-08d6-47b1-ee5ff08e7bfb"> http://api.openarch.nl/oai-pmh/</request> <GetRecord> <record> ... </record> </GetRecord> </OAI-PMH>

After:

<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> ... </record>

So the OAI-PMH envelop is not present. But I do not think this is a problem. The important point is that the data (the <record/>) can still be parsed as an item.

justinvw commented 10 years ago

Thanks for your suggestions, Bob!

Using HTTP Keep-Alive is a nice improvement that we can use across all extractors. I created a simple mixin (see https://github.com/openstate/open-cultuur-data/commit/a0a1156927ea4eeecabdc45115ece00a713d1e1b) that instantiates a 'session' through which all HTTP calls are now made. The Requests module/urllib3 takes care of the heavy lifting (connection pooling, keeping the connections alive, etc.). We now also properly identify ourselves via the User-Agent.

Regarding the use of ListRecords versus a combination of ListIdentifiers and GetRecord: I wasn't aware of the fact that lxml would automatically include the relevant namespaces in the new root element when calling etree.tostring on a single element in the tree. Nice! As you mentioned, it's important that you're able to parse the original item, which is the case with the approach you suggested. With https://github.com/openstate/open-cultuur-data/commit/681e1efd54b9fe64dfa9c7a1c58c5bb021d071cc we switched to using ListRecords.

I'm not sure if it is that much faster (at least for OpenBeelden), but theoretically it should be...