netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register
https://datasetregister.netwerkdigitaalerfgoed.nl/api/
European Union Public License 1.2
4 stars 3 forks source link

Data change frequency #772

Open ddeboer opened 1 year ago

ddeboer commented 1 year ago

Requested by @wouterbeek for the purpose of data caching: a way for publishers to indicate how often their data changes. We’re talking here about the distribution’s data, not the dataset description itself.

A proposal: to the list of dataset attributes, add a recommended property event.eventSchedule.repeatFrequency that holds the update frequency in ISO 8601 duration format.

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "encodingFormat": "application/sparql-results+xml",
  "contentUrl": "http://vocab.getty.edu/sparql",
  "dateModified": "2023-08-15",
  "event": {
    "@type": "PublicationEvent",
    "eventSchedule": {
      "@type": "Schedule",
      "repeatFrequency": "P1W"
    }
  }
}

The NDE Knowledge Graph, which regularly crawls datasets, could help by:

  1. if not supplied by the publisher, heuristically (bot not strictly) detecting dateModified by comparing the current number of triples with that found during the last crawl
  2. if not supplied by the publisher, store a last n of dateModifieds and (again, heuristically) deriving a repeatFrequency from that.

/cc @coret @rcdeboer

coret commented 1 year ago

Good idea!

Looking at schema:DataDownload, shouldn't this be publication (instead of event)?

"A way for publishers to indicate how often their data changes" feels as a promise (of a future/frequent event), not a event that occured. But, schema:eventSchedule covers this:

[...] There are circumstances where it is preferable to share a schedule for a series of repeating events rather than data on the individual events themselves. For example, a website or application might prefer to publish a schedule for a weekly gym class rather than provide data on every event. A schedule could be processed by applications to add forthcoming events to a calendar. [...]

Additionally, we have to determine how we store this schema:org based piece of information in our triplestore/KG which is DCAT based. Strangely, I only see a frequency property for the Dataset class.