OAPIR Harvesting - Githubissues

pvretano commented 4 years ago

@tomkralidis asked me about Harvesting and this is what I send him. I don't think harvesting would be part of the core but I am creating an issue anyway to stimulate discussion.

HARVEST

This is an example of harvesting one or more resources. You basically do a POST on a harvest endpoint. The harvest operation will result in N catalogue records being created.

   CLIENT                                                     SERVER
     |                                                           |
     |   POST /collections/{catalogueId}/harvest HTTP/1.1        |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://...resource URL...",           |
     |            "type": "... MIME type for resource...",       |
     |         }                                                 |
     |      ]                                                    |
     |   }                                                       |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 201 Created                                    |
     |   Location: /collections/{catalogueId}/harvest/{hId}      |
     |<----------------------------------------------------------|

Generally, the kinds of resource that I harvest in my catalogue (OGC landing pages, OGC capabilities documents, ISO metadata documents, etc.) do not have specific MIME types other than the general MIME type of the representation (e.g. text/xml, application/json, etc.). So, my server sniffs the resource to see if it can recognize it. It would be a bit easier if OGC had specific MIME types defined for some of these resources instead of just generic MIME types (old discussion!).

To trigger periodic re-harvesting of the resources, the client can append the "harvestInterval" parameter to the harvest URL. Its value is an ISO 8601 period (e.g. ...&harvestInterval=P2W&...).

To trigger asynchronous processing, the client can append a "responseHandler" parameter to the harvest endpoint URL. See: https://docs.opengeospatial.org/per/18-045.html#async_extension.

The harvest identifier (hId) is an identifier assigned by the server so that a subsequent DELETE can be used to do an unharvest.

GET THE LIST OF HARVEST IDENTIFIERS

To get the list of harvest identifiers, you can do a GET on the harvest endpoint. A list of links to each harvest resource is returned.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest   HTTP/1.1       |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "links": [                                           |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid01",         |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid57"          |       
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/hid9"           |
     |            "rel": "ogc:harvest",                          |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

I am not sure what the rel should be in this case. Perhaps something like "ogc:harvest". Dunno. Maybe we don't need one since this is just a list of harvested resources.

Doing a GET on a specific harvest resource URL will return details about that resource including links to the catalogue records created as a result of the harvest.

   CLIENT                                                     SERVER
     |                                                           |
     |   GET /collections/{catalogueId}/harvest/hid57 HTTP/1.1   |
     |   Host: www.someserver.com                                |
     |   Accept: application/json                                |
     |---------------------------------------------------------->|
     |                                                           |
     |    HTTP/1.1 200 OK                                        |
     |    Content-Type: text/json                                |
     |                                                           |
     |   {                                                       |
     |      "harvestInterval": "P2W",                            |
     |      "lastHarvest": "2020-08-01T13:41:45",                |
     |      "records": [                                         |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r1013",         |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r0087"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         {                                                 |
     |            "href": "http://www.server.com/collections/    |
     |                     {catalogueId}/harvest/r7373"          |
     |            "rel": "related"                               |
     |         },                                                |
     |         .                                                 |
     |         .                                                 |
     |         .                                                 |
     |      ]                                                    |
     |   }                                                       |
     |                                                           |
     |<----------------------------------------------------------|

Again, I am not sure about the appropriate rel. I show "related" but maybe something like "ogc:record" would be more appropriate or maybe a rel is not required at all.

UNHARVEST

To unharvest a previous harvest, you simply use the DELETE on the harvest resource URL. The server would delete all the catalogue records created by the harvest along with the harvest resource itself so that doing a subsequent GET on the harvest endpoint would no longer list hid57 (in my example).

   CLIENT                                                     SERVER
     |                                                           |
     |   DELETE /collections/{collectionId}/harvest/{hid57}   HTTP/1.1|
     |   Host: www.someserver.com                                |
     |---------------------------------------------------------->|
     |                                                           |
     |   HTTP/1.1 200 OK                                         |
     |<----------------------------------------------------------|

So, this is my thinking about harvesting so far. I have starting implementing some of this to see how it flies!

Comments welcome!

tomkralidis commented 4 years ago

We will want to consider a resourcetype type of parameter which provides additional information on what is being harvested, or a link relation type, to aid in identifying a given resource.

For an example, here's the approach we have used in pycsw: https://github.com/geopython/pycsw/blob/master/pycsw/core/metadata.py#L48.

ghobona commented 4 years ago

2020-09-15 OGC Member Meeting.

There are some use cases that are missing. Happy to provide information about the use cases. @uvoges

OAI-PMH is another approach that has been successfully used. @chris-little

OAI-PMH is outdated but was easy to implement. We should align these use cases. @uvoges

Worth having an extension (further Part) on harvesting. Jari

Harvesting and transactions should be two separate extensions of OGC API - Records. @uvoges

pvretano commented 4 years ago

Related: Inssue #50

mhogeweg commented 4 years ago

couple thoughts on this:

HARVEST

'The harvest operation will result in N catalogue records being created' - the harvest operation only returns existing records, right? it doesn't 'create' anything.

In general we approach harvesting as a combination of a list of items and a way to iterate through that list. for example:

A site provide a complete listing of the catalog content in some form. for example the DCAT variant for example for Data.gov, STACs, sitemaps, ArcGIS Servers, file shares, etc. there is no 'search' involved here, just 'give the list'.
Query for updates/additions since some date (the previous harvest date, or the beginning of time, etc) and then iterate over results to get the individual records. This is how we have done CSW harvesting for years: GetRecords + n*GetRecordById, but also THREDDS.

I agree on the MIME type issue. We also parse links we find to understand them. many times the metadata cannot be trusted and something is claimed to be a WMS, while the link is a GetMap and the resources is thus not more than a fancy link to a PNG image.

I'm not sure I understand the discussion on harvest identifiers or periodic reharvest. Our approach has always been that the server does not maintain knowledge of who or when it has been harvested. It is up to the client to maintain that information and we do this in Geoportal Server. And there we do know that records were harvested during some harvest job with an Id. Perhaps the server mentioned in this section is not the same as the server that was harvested?

In Geoportal Server Harvester we support both intervals and specific times for reharvest. I would not call this 'trigger' a harvest, but 'schedule' a harvest. The individual harvest jobs get triggered based on the schedule.

The DELETE of documents related to a harvest id seems ok. however, what if the same catalog is reharvested periodically? do the individual harvests get their own id? if so, then how do you delete all items harvested from a specific catalog across multiple jobs? Perhaps a way to delete all harvested from a specified source?

pvgenuchten commented 4 years ago

an important aspect of harvest/imports is deduplication and referencing the canonical (point-of-truth) url of the harvested item. The same resource potentially is harvested via various routes (local-regional-national-global). It would be interesting to have on the /collections/{cat}/items and /collections/{cat}/items/{id} endpoints some harvest facilitating properties in case a record has been imported; the canonical url of the item, the (last) harvested date, the encoding/schema in which it was harvested and maybe a hash of the original document

GeoNetwork's approach to harvesting seems similar to that of geoportal, we schedule harvests to run periodically, only remotely updated resources are imported again

To retrieve the list of record-id to harvest, we usually allow people to filter based on filters, the usual item filter parameter could apply. After retrieving the initial list, it would be helpfull to retrieve records in batches of 50/100 with a /records/{cat}/items?id=[12,17,23,45] operation (or CQL)

For oapir to oapir harvesting, consider to recommend the sitemap specification to facilitate the 'initial list of record-ids' case. Many catalogue implementations already support the sitemap specification. sitemap specification holds for each record the url and last update date and supports pagination for larger catalogues

pvgenuchten commented 2 years ago

from the ogc sprint today: a typical use case on harvesting is this one: request all records which have changed since last harvest, this is easy by filtering on last modification date, but it is impossible for resources which have been removed, unless the api provides a mechanism to retrieve removed items one aspect to 'solve' the above it to provide a sitemap.xml with a listing of all the record-url's, without providing the full records, just to evaluate which ones are removed

The question we had yesterday is of interest which of /collections/xxx or collections/cat/items/xxx would be the canonical url within the server (typically harvesters only harvest canonical things). In this scenario a dataset, xxx, dissiminated as a collection via ogc-api-features, could also be available as a record item in a ogc-api-records collection

mhogeweg commented 2 years ago

this is the role the dcat file plays in the open data space.

pvgenuchten commented 2 years ago

Thanx @mhogeweg, Do you have an example of such a dcat file, I know dcat only as a metadata model.

mhogeweg commented 2 years ago

here is one from the Africa Geoportal (ArcGIS Hub): https://www.africageoportal.com/data.json and one from US EPA: https://edg.epa.gov/data.json

mhogeweg commented 2 years ago

In our Geoportal Server metadata catalog, we generate such a file automatically (as a cached version, since the catalogs can become quite large) as well as in one of the available output formats of the search API.

for example: https://gpt.geocloud.com/geoportal2/opensearch?f=dcat&from=1&size=10&sort=title.sort%3Aasc&esdsl=%7B%7D

pvgenuchten commented 2 years ago

It seems a full dump of the database in rdf (jsonld), also an interesting feature considering harvesting. indeed important to cache it at intervals, some products will take minutes to generate such a file.

For the use case of identifying removed records since last harvest, I only need a list of record identifiers. Sitemap.xml would be an interesting candidate, also because of its wide adoption, but a json index file at the root of a collection would also be fine…

Search engine crawler expects only a single sitemap on a domain, so not in sub folders, but you can link to multiple decentral sitemaps from a central sitemap, see https://www.sitemaps.org/protocol.html#index

mhogeweg commented 2 years ago

we have had a sitemap in Geoportal Server for many years. It does like you describe: https://gpt.geocloud.com/geoportal/sitemap?f=sitemap

opengeospatial / ogcapi-records

OAPIR Harvesting #48

HARVEST

GET THE LIST OF HARVEST IDENTIFIERS

UNHARVEST

HARVEST