iodepo / odis-arch

Development of the Ocean Data and Information System (ODIS) architecture
https://book.oceaninfohub.org/
26 stars 16 forks source link

comments on JSON serialization, review of krillMetadata.json #368

Open smrgeoinfo opened 7 months ago

smrgeoinfo commented 7 months ago

review of krillMetadata.json

when this schema.org JSON-LD is mapped to triples, there are problems:

an http URI has a URL.... <https://doi.org/10.5066/F7VX0DMQ> <https://schema.org/url> "https://doi.org/10.5066/F7VX0DMQ" . Isn't this a URL for the landing page? Recommendation would be that the dataset/url is the landing page for the dataset.

JSON doc has geometry? <https://registry.org/permanentUrlToThisJsonDoc> <http://www.opengis.net/ont/geosparql#hasGeometry> _:b0 . NO, the dataset has some geometry, not the metadata record.

JSON doc is a Dataset <https://registry.org/permanentUrlToThisJsonDoc> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Dataset> . Well... technically I suppose its a Dataset, but makes more sense to me to type as 'DigitalDocument' recognizing the metadata record as a digital object.

JSON doc has a vessel named "Saga Sea" <https://registry.org/permanentUrlToThisJsonDoc> <https://schema.org/additionalProperty> _:b1 . Wrong. This is not a property of ThisJsonDoc. Should be dataset/variableMeasured/Property value, see https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#variables

JSON doc has SonarModel ES80 <https://registry.org/permanentUrlToThisJsonDoc> <https://schema.org/additionalProperty> _:b2 . Wrong... same for additional properties. see above

JSON doc has an identifier... (wait a minute--doesn't that identify the dataset?) <https://registry.org/permanentUrlToThisJsonDoc> <https://schema.org/identifier> <https://doi.org/10.5066/F7VX0DMQ> . Wrong

JSON doc has a URL, that is the landing page for the dataset (or maybe even the dataset)... <https://registry.org/permanentUrlToThisJsonDoc> <https://schema.org/url> "https://urlToTheDatasetOrLandingPage.org/" . https://registry.org/permanentUrlToThisJsonDoc should resolve to this JSON doc; the provided URL does not.

The JSON doc is the subject of an event.... <https://registry.org/permanentUrlToThisJsonDoc> <https://schema.org/subjectOf> _:b10 (event, name "Concise and descriptive name of the Event" .) NO, the dataset is the RESULT of an observation event. Actually a collection of observation events, generalized to a single activity. Unfortunately SDO doen't offer Activity that has results/products like data... You have to shoe horn what you want to represent into SDO. In the CDIF discovery recommendataions draft, sdo:subjectOf inside the documentation for a creativeWork (e.g. dataset) points at the metadata record about that work (which is likely to be 'Self'), this is incompatible with dataset-subjectOf-someEvent. Seem less contorted to use measurementMethod or measurementTechnique (listed as properties of dataset) and Event for the value of those properties; not consistent with rangeIncludes for those properties, but makes alot more sense to me...

Basic metadata record outline in draft CDIF recommendations. See also https://github.com/ESIPFed/science-on-schema.org/issues/245. This recommended approach avoids the above problems.

 {
    "@context": {  "@vocab": "https://schema.org/"   },
    "@type": "DigitalDocument",
    "@id": "https://registry.org/permanentUrlToThisJsonDoc--id for the metadata record",
    "name": "D20220226-T144737.metadata",
    ... other properties of the metadata record here
     "about":
    {
        "@type": "Dataset",
        "@id": "DOI or similar ID for the dataset",
        "name":"name of dataset",
        ...  properties of the dataset
        "subjectOf":<https://registry.org/permanentUrlToThisJsonDoc>
    }
jmckenna commented 7 months ago

@smrgeoinfo note: be sure to always use the master branch for the templates that you are testing. Here is your krillMetadata.json file: https://github.com/iodepo/odis-arch/blob/master/book/thematics/dataset/graphs/krillMetadata.json

pbuttigieg commented 7 months ago

CC @fils

@smrgeoinfo very interesting - as we pull all these documents into a KG that's understood as a metadata graph, it's generally been clear that the JSON-LD doc is identifying a node about something else, and not itself. This seems normative (in practice, it's "non-normative" in the docs).

Therefore, the identifiers, name, type, etc all pertain to the thing the node is about, rather than the JSON-LD document. The various triples you have above can also be read as "this node represents a vehicle", "this node has an identifier X", etc. I don't think that's wrong. The JSON-LD document itself has been transmuted to a node in a graph.

 {
    "@context": {  "@vocab": "https://schema.org/"   },
    "@type": "DigitalDocument",
    "@id": "https://registry.org/permanentUrlToThisJsonDoc--id for the metadata record",
    "name": "D20220226-T144737.metadata",
    ... other properties of the metadata record here
     "about":
    {
        "@type": "Dataset",
        "@id": "DOI or similar ID for the dataset",
        "name":"name of dataset",
        ...  properties of the dataset
        "subjectOf":<https://registry.org/permanentUrlToThisJsonDoc>
    }

I get this, but I find it quite odd that this would be expected / necessary. I mean, no JSON-LD document that is of @type: "Vehicle" is claiming to be a vehicle. This would add a node to the graph about the JSON-LD document itself, which is a bit superfluous.

pbuttigieg commented 7 months ago

We discussed this further today in our tech team meeting, and we also are concerned about the consequences this will have on the graph itself - the DigitalDocument encapsulation is going to add a non-informative cluster of nodes around each informative node that's going to mess up any downstream graph operations. We could of course strip it away, but then this seems redundant and creating more work than is necessary.

I would be very concerned if CDIF is recommending this (noting that no formal recommendations have been made yet, thus its somewhat wrong to invoke them)

smrgeoinfo commented 7 months ago

DigitalDocument encapsulation is going to add a non-informative cluster of nodes

Only if you consider information about the provenance of the metadata record itself as 'non-information', and the ability to make statements (annotation) about the metadata record as superfluous, in which case as you point out the harvester could just take the 'about' part of the metadata record. Can you clarify how statements about the metadata records 'mess up' downstream graph operations--are those operations that brittle?

The @id identifies the thing that becomes the subject of RDF triples generated from the JSON-LD. I don't think the "JSON-LD doc is identifying a node", the @id is identifying a node. That node can represent an information object (typically a digital object, could be 'self') or a non-information (physical, abstract...) object.