iodepo / odis-arch

Development of the Ocean Data and Information System (ODIS) architecture
https://book.odis.org/
29 stars 17 forks source link

Book: annotated Dataset example comparison to CDIF #472

Open smrgeoinfo opened 2 days ago

smrgeoinfo commented 2 days ago

Comparing example at https://book.odis.org/thematics/dataset/index.html#id1 with CDIF recommendations (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/schemaorgimplementation.html#schema-org-implementation-of-cdif-metadata) and examples at https://github.com/Cross-Domain-Interoperability-Framework/cdifbook/tree/main/examples

{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Dataset",
    "@id": "https://example.org/permanentUrlToThisJsonDoc",
    "name": "A concise but descriptive name of the dataset",
    "description": "An extended, free-text description of what's in the dataset, who created it, and other attributes",
    "url": "https://urlToTheDatasetOrLandingPage.org/",

Good alignment to here; CDIF has dcterms: in the @context as well.

    "sameAs": [
        "http://alternativeUrlToTheDatasetOrLandingPage.org"
    ],

Not clear what the point of this is-- looks like an alternate link to the landing page, so its sameAs the landing page?

"license": "This work is licensed under a Creative Commons Attribution (CC-BY) 4.0 License", Good alignment

    "citation": [
        "Citation to other work relevant to this dataset",
        "Citation to other work relevant to this dataset",
        "Citation to other work relevant to this dataset"
    ],

CDIF Doesn't include citation in recommendation. Personally, I'd recommend against using it because it's so frequently misunderstood. In CDIF, if you want to link to 'other work relevant to this dataset', use schema:relatedLink.

"version": "2021-04-24T06:34:56.000Z",

Good alignment

    "keywords": [
        "Keyword 1",
        "Keyword 2",
        "Keyword 3"
    ],

Partial alignment; CDIF recommends using schema:DefinedTerm for keyword from an identifiable controlled vocabulary.

"measurementTechnique": "The URL to or text about the methods, technique or technology used to generate this Dataset", 

Not in CDIF recommendation; for this kind of prov information in CDIF discovery, recommendation is to use prov:wasGeneratedBy to link to sensors, instruments, software, algorithms. For full description of data creation use the CDIF data integration profile.

    "variableMeasured": [
        {
            "@type": "PropertyValue",
            "name": "Name of a variable in the dataset",
            "description": "Extended description of this variable"
        },
        {
            "@type": "PropertyValue",
            "name": "Name of a variable in the dataset",
            "url": "http://ontology.org/uriToSemanticDescriptorOfThisVariable",
            "description": "Extended description of this variable?"
        },
        {
            "@type": "PropertyValue",
            "name": "SamplingDeviceApertureSurfaceArea",
            "url": "http://ontology.org/uriToSemanticDescriptorOfThisVariable",
            "description": "Extended description of this variable"
        }
    ],

Partial alignment. CDIF also includes use of schema:StatisticalVariable for schema:variableMeasured. For schema:PropertyValue, the guidance is "Variable must have a name and description, should have a propertyID with URI for the represented concept. The URI in the propertyID provides the semantic linkage for meaning of the variable."

    "includedInDataCatalog": {
        "@id": "https://registryOfCatalogs.org/permanentUrlIdentifiyingCatalog",
        "@type": "DataCatalog",
        "url": "https://urlOfDataCatalog.org"
    },

Not in CDIF Discovery recommendation. Is this supposed to identify the source of the metadata record; if so it should be in the metadata about the metadata section that CDIF recommends (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/contentmodel.html#properties-for-metadata-management)? Usually the actual dataset that the metadata is about is in a repository, not generally referred to as a 'DataCatalog'.

    "temporalCoverage": "2007/2007",
    "distribution": {
        "@type": "DataDownload",
        "contentUrl": "http://urlToDirectDownloadOfThisDataset.org/",
        "encodingFormat": "text/csv"
    },

Good Alignment. CDIF also includes recommendation for API-based data distribution (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/schemaorgimplementation.html#service-based-distribution) analogous to DCAT:accessService

    "spatialCoverage": {
        "@type": "Place",
        "geo": {
            "@type": "GeoShape",
            "description": "schema.org expects lat long (Y X) coordinate order",
            "polygon": "10.161667 142.014,18.033833 142.014,18.033833 147.997833,10.161667 147.997833,10.161667 142.014"
        },
        "additionalProperty": {
            "@type": "PropertyValue",
            "propertyID": "https://dbpedia.org/page/Spatial_reference_system",
            "value": "https://www.w3.org/2003/01/geo/wgs84_pos"
        }
    },

CDIF requires a schema:box, schema:line, schema:point or a named place (Place/name with string or DefinedTerm). Guidance for box: "For bounding box specification of the spatial extent of resource content. See ESIP SOSO for details. Recommend including only one bounding box; behavior of harvesting clients when multiple geometries are specified is unpredictable". CDIF also provides for optional geographic extents using other more interoperable geometries, GeoSPARQL us recommended, see Ocean InfoHub. Other geometry schemes might be specified in a specific domain profile, e.g. for atmospheric, subsurface data, or local coordinate systems.

    "provider": [
        {
            "@type": "Organization",
            "legalName": "Legal Name of Organisation which generated the dataset",
            "name": "Other Name of Organisation which generated the dataset",
            "url": "https://organisationWebsite.org/"
        }
    ],

CDIF guidance is that provider is the contact point for the agent responsible for a resource distribution; this is different from 'agent that generated the dataset'.

    "subjectOf": {
        "@type": "Event",
        "description": "Describe the event which is the subject of this dataset. For example, a cruise ID.",
        "name": "Concise and descriptive name of the Event",
        "potentialAction": {
            "@type": "Action",
            "name": "Concise but descriptive name of action that was part of an Event. For example, the name of a CTD cast",
            "agent": [
                "Name or permanent ID of person or thing that performed this action",
                "Name or permanent ID of person or thing that performed this action",
                "Name or permanent ID of person or thing that performed this action"
            ],
            "startTime": "2007-03-11T14:45UTC",
            "endTime": "2007-03-11T15:42UTC",

            "instrument": {
                "@type": "Thing",
                "name": "The name of the instrument used in the action. For example, the specific model of a CTD, a glider, a moored sensor",
                "url": "http://ontology.org/uriToSemanticDescriptorOfThisInstrument",
                "description": "Extended description of the sampling instrument"
            }  
    }    }    }

CDIF uses 'subjectOf for the graph node with metadata about the metadata record (dateModified, conformsTo, responsible parties...). Its not clear from the example here what the intention is. The schema.org guidance for subjectOf is that its value is "A CreativeWork or Event about this Thing." So this example would appear to document some event or creativeWork that is about the described dataset. My suspicion is that its supposed to document workflow that created the dataset? CDIF recommends using prov:wasGeneratedBy to document instruments, sensors, algorithms, software etc. used to create the dataset, and prov:wasDerivedFrom to document resources (e.g. source datasets) that were used to create the described dataset. CDIF would link to CreativeWorks about the resource using relatedLink.

smrgeoinfo commented 2 days ago

Done for now