Open pvretano opened 4 years ago
We will want to consider a resourcetype
type of parameter which provides additional information on what is being harvested, or a link relation type, to aid in identifying a given resource.
For an example, here's the approach we have used in pycsw: https://github.com/geopython/pycsw/blob/master/pycsw/core/metadata.py#L48.
2020-09-15 OGC Member Meeting.
There are some use cases that are missing. Happy to provide information about the use cases. @uvoges
OAI-PMH is another approach that has been successfully used. @chris-little
OAI-PMH is outdated but was easy to implement. We should align these use cases. @uvoges
Worth having an extension (further Part) on harvesting. Jari
Harvesting and transactions should be two separate extensions of OGC API - Records. @uvoges
Related: Inssue #50
couple thoughts on this:
'The harvest operation will result in N catalogue records being created' - the harvest operation only returns existing records, right? it doesn't 'create' anything.
In general we approach harvesting as a combination of a list of items and a way to iterate through that list. for example:
I agree on the MIME type issue. We also parse links we find to understand them. many times the metadata cannot be trusted and something is claimed to be a WMS, while the link is a GetMap and the resources is thus not more than a fancy link to a PNG image.
I'm not sure I understand the discussion on harvest identifiers or periodic reharvest. Our approach has always been that the server does not maintain knowledge of who or when it has been harvested. It is up to the client to maintain that information and we do this in Geoportal Server. And there we do know that records were harvested during some harvest job with an Id. Perhaps the server mentioned in this section is not the same as the server that was harvested?
In Geoportal Server Harvester we support both intervals and specific times for reharvest. I would not call this 'trigger' a harvest, but 'schedule' a harvest. The individual harvest jobs get triggered based on the schedule.
The DELETE of documents related to a harvest id seems ok. however, what if the same catalog is reharvested periodically? do the individual harvests get their own id? if so, then how do you delete all items harvested from a specific catalog across multiple jobs? Perhaps a way to delete all harvested from a specified source?
an important aspect of harvest/imports is deduplication and referencing the canonical (point-of-truth) url of the harvested item. The same resource potentially is harvested via various routes (local-regional-national-global). It would be interesting to have on the /collections/{cat}/items and /collections/{cat}/items/{id} endpoints some harvest facilitating properties in case a record has been imported; the canonical url of the item, the (last) harvested date, the encoding/schema in which it was harvested and maybe a hash of the original document
GeoNetwork's approach to harvesting seems similar to that of geoportal, we schedule harvests to run periodically, only remotely updated resources are imported again
To retrieve the list of record-id to harvest, we usually allow people to filter based on filters, the usual item filter parameter could apply. After retrieving the initial list, it would be helpfull to retrieve records in batches of 50/100 with a /records/{cat}/items?id=[12,17,23,45] operation (or CQL)
For oapir to oapir harvesting, consider to recommend the sitemap specification to facilitate the 'initial list of record-ids' case. Many catalogue implementations already support the sitemap specification. sitemap specification holds for each record the url and last update date and supports pagination for larger catalogues
from the ogc sprint today: a typical use case on harvesting is this one: request all records which have changed since last harvest, this is easy by filtering on last modification date, but it is impossible for resources which have been removed, unless the api provides a mechanism to retrieve removed items one aspect to 'solve' the above it to provide a sitemap.xml with a listing of all the record-url's, without providing the full records, just to evaluate which ones are removed
The question we had yesterday is of interest which of /collections/xxx or collections/cat/items/xxx would be the canonical url within the server (typically harvesters only harvest canonical things). In this scenario a dataset, xxx, dissiminated as a collection via ogc-api-features, could also be available as a record item in a ogc-api-records collection
this is the role the dcat file plays in the open data space.
Thanx @mhogeweg, Do you have an example of such a dcat file, I know dcat only as a metadata model.
here is one from the Africa Geoportal (ArcGIS Hub): https://www.africageoportal.com/data.json and one from US EPA: https://edg.epa.gov/data.json
In our Geoportal Server metadata catalog, we generate such a file automatically (as a cached version, since the catalogs can become quite large) as well as in one of the available output formats of the search API.
for example: https://gpt.geocloud.com/geoportal2/opensearch?f=dcat&from=1&size=10&sort=title.sort%3Aasc&esdsl=%7B%7D
It seems a full dump of the database in rdf (jsonld), also an interesting feature considering harvesting. indeed important to cache it at intervals, some products will take minutes to generate such a file.
For the use case of identifying removed records since last harvest, I only need a list of record identifiers. Sitemap.xml would be an interesting candidate, also because of its wide adoption, but a json index file at the root of a collection would also be fine…
Search engine crawler expects only a single sitemap on a domain, so not in sub folders, but you can link to multiple decentral sitemaps from a central sitemap, see https://www.sitemaps.org/protocol.html#index
we have had a sitemap in Geoportal Server for many years. It does like you describe: https://gpt.geocloud.com/geoportal/sitemap?f=sitemap
@tomkralidis asked me about Harvesting and this is what I send him. I don't think harvesting would be part of the core but I am creating an issue anyway to stimulate discussion.
HARVEST
This is an example of harvesting one or more resources. You basically do a POST on a harvest endpoint. The harvest operation will result in N catalogue records being created.
Generally, the kinds of resource that I harvest in my catalogue (OGC landing pages, OGC capabilities documents, ISO metadata documents, etc.) do not have specific MIME types other than the general MIME type of the representation (e.g. text/xml, application/json, etc.). So, my server sniffs the resource to see if it can recognize it. It would be a bit easier if OGC had specific MIME types defined for some of these resources instead of just generic MIME types (old discussion!).
To trigger periodic re-harvesting of the resources, the client can append the "harvestInterval" parameter to the harvest URL. Its value is an ISO 8601 period (e.g. ...&harvestInterval=P2W&...).
To trigger asynchronous processing, the client can append a "responseHandler" parameter to the harvest endpoint URL. See: https://docs.opengeospatial.org/per/18-045.html#async_extension.
The harvest identifier (hId) is an identifier assigned by the server so that a subsequent DELETE can be used to do an unharvest.
GET THE LIST OF HARVEST IDENTIFIERS
To get the list of harvest identifiers, you can do a GET on the harvest endpoint. A list of links to each harvest resource is returned.
I am not sure what the rel should be in this case. Perhaps something like "ogc:harvest". Dunno. Maybe we don't need one since this is just a list of harvested resources.
Doing a GET on a specific harvest resource URL will return details about that resource including links to the catalogue records created as a result of the harvest.
Again, I am not sure about the appropriate rel. I show "related" but maybe something like "ogc:record" would be more appropriate or maybe a rel is not required at all.
UNHARVEST
To unharvest a previous harvest, you simply use the DELETE on the harvest resource URL. The server would delete all the catalogue records created by the harvest along with the harvest resource itself so that doing a subsequent GET on the harvest endpoint would no longer list hid57 (in my example).
So, this is my thinking about harvesting so far. I have starting implementing some of this to see how it flies!
Comments welcome!