iodepo / odis-arch

Development of the Ocean Data and Information System (ODIS) architecture
https://book.oceaninfohub.org/
26 stars 16 forks source link

Select DwC fields to embed in JSON-LD records #447

Open pbuttigieg opened 6 days ago

pbuttigieg commented 6 days ago

@pieterprovoost

Some non-redundant DwC properties can be of high value for ODIS-level discovery, but @pieterprovoost notes that a full embedding of all DwC fields is prohibitive (GB sized JSON-LD would be expected)

In this issue, we'll triage which DwC terms should be embedded in additionalProperty or 'variableMeasured' schema properties

pbuttigieg commented 6 days ago

For reference

Classes

dwc:Dataset | dwc:Event | dwc:EventAttribute | dwc:EventMeasurement | dwc:FossilSpecimen | dwc:GeologicalContext | dwc:HumanObservation | dwc:Identification | dwc:LivingSpecimen | dcterms:Location | dwc:MachineObservation | dwc:MaterialCitation | dwc:MaterialEntity | dwc:MaterialSample | dwc:MeasurementOrFact | dwc:Occurrence | dwc:OccurrenceMeasurement | dwc:Organism | dwc:PreservedSpecimen | dwc:ResourceRelationship | dwc:Sample | dwc:SampleAttribute | dwc:SamplingEvent | dwc:SamplingLocation | dwc:Taxon

Record level

dwc:accordingTo | dwc:accuracy | dwc:basisOfRecord | dwc:collectionCode | dwc:collectionID | dwc:dataGeneralizations | dwc:datasetID | dwc:datasetName | dwc:DwCType | dwc:dynamicProperties | dwc:Generalizations | dwc:informationWithheld | dwc:institutionCode | dwc:institutionID | dwc:ownerInstitutionCode

Dublin Core legacy namespace

dc:language | dc:type

Dublin Core terms namespace

dcterms:accessRights | dcterms:bibliographicCitation | dcterms:language | dcterms:license | dcterms:modified | dcterms:references | dcterms:rights | dcterms:rightsHolder | dcterms:type

Occurrence

dwc:associatedMedia | dwc:associatedOccurrences | dwc:associatedReferences | dwc:associatedTaxa | dwc:behavior | dwc:caste | dwc:catalogNumber | dwc:CatalogNumberNumeric | dwc:degreeOfEstablishment | dwc:establishmentMeans | dwc:georeferenceVerificationStatus | dwc:individualCount | dwc:individualID | dwc:lifeStage | dwc:occurrenceAttributes | dwc:occurrenceDetails | dwc:occurrenceID | dwc:occurrenceRemarks | dwc:occurrenceStatus | dwc:organismQuantity | dwc:organismQuantityType | dwc:otherCatalogNumbers | dwc:pathway | dwc:recordedBy | dwc:recordedByID | dwc:recordNumber | dwc:reproductiveCondition | dwc:sex | dwc:vitality

Organism

dwc:associatedOrganisms | dwc:organismID | dwc:organismName | dwc:organismRemarks | dwc:organismScope | dwc:previousIdentifications

Material Entity

dwc:associatedSequences | dwc:disposition | dwc:materialEntityID | dwc:materialEntityRemarks | dwc:preparations | dwc:verbatimLabel

Material Sample

dwc:materialSampleID

Event

dwc:day | dwc:EarliestDateCollected | dwc:endDayOfYear | dwc:EndTimeOfDay | dwc:eventAttributes | dwc:eventDate | dwc:eventID | dwc:eventRemarks | dwc:eventTime | dwc:eventType | dwc:fieldNotes | dwc:fieldNumber | dwc:habitat | dwc:LatestDateCollected | dwc:month | dwc:parentEventID | dwc:sampleSizeUnit | dwc:sampleSizeValue | dwc:samplingEffort | dwc:samplingProtocol | dwc:startDayOfYear | dwc:StartTimeOfDay | dwc:verbatimEventDate | dwc:year

Location

dwc:continent | dwc:coordinatePrecision | dwc:coordinateUncertaintyInMeters | dwc:country | dwc:countryCode | dwc:county | dwc:decimalLatitude | dwc:decimalLongitude | dwc:footprintSpatialFit | dwc:footprintSRS | dwc:footprintWKT | dwc:geodeticDatum | dwc:georeferencedBy | dwc:georeferencedDate | dwc:georeferenceProtocol | dwc:georeferenceRemarks | dwc:georeferenceSources | dwc:higherGeography | dwc:higherGeographyID | dwc:island | dwc:islandGroup | dwc:locality | dwc:locationAccordingTo | dwc:locationAttributes | dwc:locationID | dwc:locationRemarks | dwc:maximumDepthInMeters | dwc:maximumDistanceAboveSurfaceInMeters | dwc:maximumElevationInMeters | dwc:minimumDepthInMeters | dwc:minimumDistanceAboveSurfaceInMeters | dwc:minimumElevationInMeters | dwc:municipality | dwc:pointRadiusSpatialFit | dwc:SamplingLocationID | dwc:SamplingLocationRemarks | dwc:stateProvince | dwc:verbatimCoordinates | dwc:verbatimCoordinateSystem | dwc:verbatimDepth | dwc:verbatimElevation | dwc:verbatimLatitude | dwc:verbatimLocality | dwc:verbatimLongitude | dwc:verbatimSRS | dwc:verticalDatum | dwc:waterBody

Geological Context

dwc:bed | dwc:earliestAgeOrLowestStage | dwc:earliestEonOrLowestEonothem | dwc:earliestEpochOrLowestSeries | dwc:earliestEraOrLowestErathem | dwc:earliestPeriodOrLowestSystem | dwc:formation | dwc:geologicalContextID | dwc:group | dwc:highestBiostratigraphicZone | dwc:latestAgeOrHighestStage | dwc:latestEonOrHighestEonothem | dwc:latestEpochOrHighestSeries | dwc:latestEraOrHighestErathem | dwc:latestPeriodOrHighestSystem | dwc:lithostratigraphicTerms | dwc:lowestBiostratigraphicZone | dwc:member

Identification

dwc:dateIdentified | dwc:identificationAttributes | dwc:identificationID | dwc:identificationQualifier | dwc:identificationReferences | dwc:identificationRemarks | dwc:identificationVerificationStatus | dwc:identifiedBy | dwc:identifiedByID | dwc:PreviousIdentifications | dwc:typeStatus | dwc:verbatimIdentification

Taxon

dwc:acceptedNameUsage | dwc:acceptedNameUsageID | dwc:acceptedScientificName | dwc:acceptedScientificNameID | dwc:AcceptedTaxon | dwc:AcceptedTaxonID | dwc:acceptedTaxonID | dwc:acceptedTaxonName | dwc:acceptedTaxonNameID | dwc:basionym | dwc:basionymID | dwc:binomial | dwc:class | dwc:cultivarEpithet | dwc:family | dwc:genericName | dwc:genus | dwc:higherClassification | dwc:HigherTaxon | dwc:higherTaxonconceptID | dwc:HigherTaxonID | dwc:higherTaxonName | dwc:higherTaxonNameID | dwc:infragenericEpithet | dwc:infraspecificEpithet | dwc:kingdom | dwc:nameAccordingTo | dwc:nameAccordingToID | dwc:namePublicationID | dwc:namePublishedIn | dwc:namePublishedInID | dwc:namePublishedInYear | dwc:nomenclaturalCode | dwc:nomenclaturalStatus | dwc:order | dwc:originalNameUsage | dwc:originalNameUsageID | dwc:parentNameUsage | dwc:parentNameUsageID | dwc:phylum | dwc:scientificName | dwc:scientificNameAuthorship | dwc:scientificNameID | dwc:scientificNameRank | dwc:specificEpithet | dwc:subfamily | dwc:subgenus | dwc:subtribe | dwc:superfamily | dwc:taxonAccordingTo | dwc:taxonAttributes | dwc:taxonConceptID | dwc:TaxonID | dwc:taxonID | dwc:taxonNameID | dwc:taxonomicStatus | dwc:taxonRank | dwc:taxonRemarks | dwc:tribe | dwc:verbatimScientificNameRank | dwc:verbatimTaxonRank | dwc:vernacularName

Measurement or Fact

dwc:measurementAccuracy | dwc:measurementDeterminedBy | dwc:measurementDeterminedDate | dwc:measurementID | dwc:measurementMethod | dwc:measurementRemarks | dwc:measurementType | dwc:measurementUnit | dwc:measurementValue | dwc:parentMeasurementID

Resource Relationship

dwc:RelatedBasisOfRecord | dwc:relatedResourceID | dwc:relatedResourceType | dwc:relationshipAccordingTo | dwc:relationshipEstablishedDate | dwc:relationshipOfResource | dwc:relationshipOfResourceID | dwc:relationshipRemarks | dwc:resourceID | dwc:resourceRelationshipID

IRI-value terms

dwciri:behavior | dwciri:caste | dwciri:dataGeneralizations | dwciri:degreeOfEstablishment | dwciri:disposition | dwciri:earliestGeochronologicalEra | dwciri:establishmentMeans | dwciri:eventType | dwciri:fieldNotes | dwciri:fieldNumber | dwciri:footprintSRS | dwciri:footprintWKT | dwciri:fromLithostratigraphicUnit | dwciri:geodeticDatum | dwciri:georeferencedBy | dwciri:georeferenceProtocol | dwciri:georeferenceSources | dwciri:georeferenceVerificationStatus | dwciri:habitat | dwciri:identificationQualifier | dwciri:identificationVerificationStatus | dwciri:identifiedBy | dwciri:inCollection | dwciri:inDataset | dwciri:inDescribedPlace | dwciri:informationWithheld | dwciri:latestGeochronologicalEra | dwciri:lifeStage | dwciri:locationAccordingTo | dwciri:measurementDeterminedBy | dwciri:measurementMethod | dwciri:measurementType | dwciri:measurementUnit | dwciri:measurementValue | dwciri:occurrenceStatus | dwciri:organismQuantityType | dwciri:pathway | dwciri:preparations | dwciri:recordedBy | dwciri:recordNumber | dwciri:reproductiveCondition | dwciri:sampleSizeUnit | dwciri:samplingProtocol | dwciri:sex | dwciri:toTaxon | dwciri:typeStatus | dwciri:verbatimCoordinateSystem | dwciri:verbatimSRS | dwciri:verticalDatum | dwciri:vitality

pbuttigieg commented 6 days ago

@pieterprovoost here's a first triage from me.

Notes

Add to embedding in ODIS records (for taxomonic levels, @pieterprovoost noted these may only go down to order or family to not flood the metadata, the rest would be available in the OBIS records):

Map to schema.org properties

Spatial mapping to GeoJSON and/or schema.org spatial properties in their stanzas, may be a bit involved, but worth it (I assume many of these are already mapped):

pieterprovoost commented 6 days ago

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

pbuttigieg commented 6 days ago

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

In the sense that a Dataset can be about potentially thousands of Taxa? Aggregation at higher ranks I think.

For Events, maybe taking the extreme values of space and time and creating an inclusive pocket to push to the Dataset metadata.

For things that hang off of Occurrence, like dwc:habitat, that's trickier - arrays in Dataset properties like about come to mind, but this may be one too many jumps.

If OBIS eventually releases truncated metadata about the other types (Events, Taxa, maybe Occurrences for specific species [e.g. of concern, keystones, invasives]) this would of course be easier from the Dataset metadata (via @id referencing). Maybe that can wait for that stage.

These are fields that I think would be useful for ODIS-level discovery of OBIS resources - if adding them is prohibitively complex or would put prohibitive demands on the systems involved, we can mark them for later consideration.

Could you check mark the terms above that you think are the most feasible to add now? We can discuss how to add some high-value ones that are harder in a meeting perhaps.

pbuttigieg commented 6 days ago

And I'm quite sure that some of the DwC value syntax will conflict with schema.org / OGC constraints - that's important to note, even if those properties don't make it into the JSON-LD/schema.org products. Those are a basis to trigger later alignment of the standards themselves, hopefully.