te-papa / collections-api

Museum of New Zealand Te Papa Tongarewa - Collections API
10 stars 2 forks source link

Search strategies: Bulk data retrieval #8

Open fkleon opened 5 years ago

fkleon commented 5 years ago

Bulk data retrieval

It's possible to efficiently retrieve a large number of records, or even all records through the scroll API. The implementation is using the Elasticsearch scroll API under the hood, which is referred to in the documentation.

The same concepts apply, most importantly:

The Te Papa Collections API imposes some additional restrictions to avoid too much strain on the cluster:

A scroll is requested through any of the _scroll APIs. For example, by using /objects/_scroll the results are pre-filtered to only contain collection objects. To retrieve all object types, use the /search/_scroll API.

A scroll is opened with a POST request to a _scroll API, for example:

curl -XPOST -G \
  -H 'x-api-key: KEY' \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'duration=1' \
  --data-urlencode 'size=1'

This requests a scroll that is kept open for 1 minute (duration) and contains 1 result per page (size). The result looks like an ordinary search result, with one addition, the _metadata.query.scrollId field which contains the unique scroll ID.

The next page of the scroll can then be retrieved through the GET scroll API. An API-root relative link to the next page is included in the Location header of the initial scroll response, or can be build based on the scroll ID in the response body. Example:

curl -G \
  'https://data.tepapa.govt.nz/collection/scroll/<SCROLL-ID>' \
  --data-urlencode 'duration=1'

Once the scroll is exhausted the GET scroll API returns an HTTP 204 No Content response.

An arbitrary search request can be use to control which records are included in a scroll result. For example, to only retrieve objects that have been modified recently, a date range query can be added to the initial request:

curl -XPOST -G \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'q=_meta.modified:[2018-09-06 TO *]' \
  --data-urlencode 'size=1'

The relevant API documentation is here:

staplegun commented 5 years ago

There is also a section in Getting Started - https://github.com/te-papa/collections-api/wiki/Getting-started#scrolling

fkleon commented 5 years ago

Great, I've somehow missed that. Feel free to merge useful bits into that one.