iodepo / odis-arch

Development of the Ocean Data and Information System (ODIS) architecture
26 stars 16 forks source link

Select DwC fields to embed in JSON-LD records #447

Open pbuttigieg opened 6 days ago

pbuttigieg commented 6 days ago


Some non-redundant DwC properties can be of high value for ODIS-level discovery, but @pieterprovoost notes that a full embedding of all DwC fields is prohibitive (GB sized JSON-LD would be expected)

In this issue, we'll triage which DwC terms should be embedded in additionalProperty or 'variableMeasured' schema properties

pbuttigieg commented 6 days ago

For reference


dwc:Dataset | dwc:Event | dwc:EventAttribute | dwc:EventMeasurement | dwc:FossilSpecimen | dwc:GeologicalContext | dwc:HumanObservation | dwc:Identification | dwc:LivingSpecimen | dcterms:Location | dwc:MachineObservation | dwc:MaterialCitation | dwc:MaterialEntity | dwc:MaterialSample | dwc:MeasurementOrFact | dwc:Occurrence | dwc:OccurrenceMeasurement | dwc:Organism | dwc:PreservedSpecimen | dwc:ResourceRelationship | dwc:Sample | dwc:SampleAttribute | dwc:SamplingEvent | dwc:SamplingLocation | dwc:Taxon

Record level

dwc:accordingTo | dwc:accuracy | dwc:basisOfRecord | dwc:collectionCode | dwc:collectionID | dwc:dataGeneralizations | dwc:datasetID | dwc:datasetName | dwc:DwCType | dwc:dynamicProperties | dwc:Generalizations | dwc:informationWithheld | dwc:institutionCode | dwc:institutionID | dwc:ownerInstitutionCode

Dublin Core legacy namespace

dc:language | dc:type

Dublin Core terms namespace

dcterms:accessRights | dcterms:bibliographicCitation | dcterms:language | dcterms:license | dcterms:modified | dcterms:references | dcterms:rights | dcterms:rightsHolder | dcterms:type


dwc:associatedMedia | dwc:associatedOccurrences | dwc:associatedReferences | dwc:associatedTaxa | dwc:behavior | dwc:caste | dwc:catalogNumber | dwc:CatalogNumberNumeric | dwc:degreeOfEstablishment | dwc:establishmentMeans | dwc:georeferenceVerificationStatus | dwc:individualCount | dwc:individualID | dwc:lifeStage | dwc:occurrenceAttributes | dwc:occurrenceDetails | dwc:occurrenceID | dwc:occurrenceRemarks | dwc:occurrenceStatus | dwc:organismQuantity | dwc:organismQuantityType | dwc:otherCatalogNumbers | dwc:pathway | dwc:recordedBy | dwc:recordedByID | dwc:recordNumber | dwc:reproductiveCondition | dwc:sex | dwc:vitality


dwc:associatedOrganisms | dwc:organismID | dwc:organismName | dwc:organismRemarks | dwc:organismScope | dwc:previousIdentifications

Material Entity

dwc:associatedSequences | dwc:disposition | dwc:materialEntityID | dwc:materialEntityRemarks | dwc:preparations | dwc:verbatimLabel

Material Sample



dwc:day | dwc:EarliestDateCollected | dwc:endDayOfYear | dwc:EndTimeOfDay | dwc:eventAttributes | dwc:eventDate | dwc:eventID | dwc:eventRemarks | dwc:eventTime | dwc:eventType | dwc:fieldNotes | dwc:fieldNumber | dwc:habitat | dwc:LatestDateCollected | dwc:month | dwc:parentEventID | dwc:sampleSizeUnit | dwc:sampleSizeValue | dwc:samplingEffort | dwc:samplingProtocol | dwc:startDayOfYear | dwc:StartTimeOfDay | dwc:verbatimEventDate | dwc:year


dwc:continent | dwc:coordinatePrecision | dwc:coordinateUncertaintyInMeters | dwc:country | dwc:countryCode | dwc:county | dwc:decimalLatitude | dwc:decimalLongitude | dwc:footprintSpatialFit | dwc:footprintSRS | dwc:footprintWKT | dwc:geodeticDatum | dwc:georeferencedBy | dwc:georeferencedDate | dwc:georeferenceProtocol | dwc:georeferenceRemarks | dwc:georeferenceSources | dwc:higherGeography | dwc:higherGeographyID | dwc:island | dwc:islandGroup | dwc:locality | dwc:locationAccordingTo | dwc:locationAttributes | dwc:locationID | dwc:locationRemarks | dwc:maximumDepthInMeters | dwc:maximumDistanceAboveSurfaceInMeters | dwc:maximumElevationInMeters | dwc:minimumDepthInMeters | dwc:minimumDistanceAboveSurfaceInMeters | dwc:minimumElevationInMeters | dwc:municipality | dwc:pointRadiusSpatialFit | dwc:SamplingLocationID | dwc:SamplingLocationRemarks | dwc:stateProvince | dwc:verbatimCoordinates | dwc:verbatimCoordinateSystem | dwc:verbatimDepth | dwc:verbatimElevation | dwc:verbatimLatitude | dwc:verbatimLocality | dwc:verbatimLongitude | dwc:verbatimSRS | dwc:verticalDatum | dwc:waterBody

Geological Context

dwc:bed | dwc:earliestAgeOrLowestStage | dwc:earliestEonOrLowestEonothem | dwc:earliestEpochOrLowestSeries | dwc:earliestEraOrLowestErathem | dwc:earliestPeriodOrLowestSystem | dwc:formation | dwc:geologicalContextID | dwc:group | dwc:highestBiostratigraphicZone | dwc:latestAgeOrHighestStage | dwc:latestEonOrHighestEonothem | dwc:latestEpochOrHighestSeries | dwc:latestEraOrHighestErathem | dwc:latestPeriodOrHighestSystem | dwc:lithostratigraphicTerms | dwc:lowestBiostratigraphicZone | dwc:member


dwc:dateIdentified | dwc:identificationAttributes | dwc:identificationID | dwc:identificationQualifier | dwc:identificationReferences | dwc:identificationRemarks | dwc:identificationVerificationStatus | dwc:identifiedBy | dwc:identifiedByID | dwc:PreviousIdentifications | dwc:typeStatus | dwc:verbatimIdentification


dwc:acceptedNameUsage | dwc:acceptedNameUsageID | dwc:acceptedScientificName | dwc:acceptedScientificNameID | dwc:AcceptedTaxon | dwc:AcceptedTaxonID | dwc:acceptedTaxonID | dwc:acceptedTaxonName | dwc:acceptedTaxonNameID | dwc:basionym | dwc:basionymID | dwc:binomial | dwc:class | dwc:cultivarEpithet | dwc:family | dwc:genericName | dwc:genus | dwc:higherClassification | dwc:HigherTaxon | dwc:higherTaxonconceptID | dwc:HigherTaxonID | dwc:higherTaxonName | dwc:higherTaxonNameID | dwc:infragenericEpithet | dwc:infraspecificEpithet | dwc:kingdom | dwc:nameAccordingTo | dwc:nameAccordingToID | dwc:namePublicationID | dwc:namePublishedIn | dwc:namePublishedInID | dwc:namePublishedInYear | dwc:nomenclaturalCode | dwc:nomenclaturalStatus | dwc:order | dwc:originalNameUsage | dwc:originalNameUsageID | dwc:parentNameUsage | dwc:parentNameUsageID | dwc:phylum | dwc:scientificName | dwc:scientificNameAuthorship | dwc:scientificNameID | dwc:scientificNameRank | dwc:specificEpithet | dwc:subfamily | dwc:subgenus | dwc:subtribe | dwc:superfamily | dwc:taxonAccordingTo | dwc:taxonAttributes | dwc:taxonConceptID | dwc:TaxonID | dwc:taxonID | dwc:taxonNameID | dwc:taxonomicStatus | dwc:taxonRank | dwc:taxonRemarks | dwc:tribe | dwc:verbatimScientificNameRank | dwc:verbatimTaxonRank | dwc:vernacularName

Measurement or Fact

dwc:measurementAccuracy | dwc:measurementDeterminedBy | dwc:measurementDeterminedDate | dwc:measurementID | dwc:measurementMethod | dwc:measurementRemarks | dwc:measurementType | dwc:measurementUnit | dwc:measurementValue | dwc:parentMeasurementID

Resource Relationship

dwc:RelatedBasisOfRecord | dwc:relatedResourceID | dwc:relatedResourceType | dwc:relationshipAccordingTo | dwc:relationshipEstablishedDate | dwc:relationshipOfResource | dwc:relationshipOfResourceID | dwc:relationshipRemarks | dwc:resourceID | dwc:resourceRelationshipID

IRI-value terms

dwciri:behavior | dwciri:caste | dwciri:dataGeneralizations | dwciri:degreeOfEstablishment | dwciri:disposition | dwciri:earliestGeochronologicalEra | dwciri:establishmentMeans | dwciri:eventType | dwciri:fieldNotes | dwciri:fieldNumber | dwciri:footprintSRS | dwciri:footprintWKT | dwciri:fromLithostratigraphicUnit | dwciri:geodeticDatum | dwciri:georeferencedBy | dwciri:georeferenceProtocol | dwciri:georeferenceSources | dwciri:georeferenceVerificationStatus | dwciri:habitat | dwciri:identificationQualifier | dwciri:identificationVerificationStatus | dwciri:identifiedBy | dwciri:inCollection | dwciri:inDataset | dwciri:inDescribedPlace | dwciri:informationWithheld | dwciri:latestGeochronologicalEra | dwciri:lifeStage | dwciri:locationAccordingTo | dwciri:measurementDeterminedBy | dwciri:measurementMethod | dwciri:measurementType | dwciri:measurementUnit | dwciri:measurementValue | dwciri:occurrenceStatus | dwciri:organismQuantityType | dwciri:pathway | dwciri:preparations | dwciri:recordedBy | dwciri:recordNumber | dwciri:reproductiveCondition | dwciri:sampleSizeUnit | dwciri:samplingProtocol | dwciri:sex | dwciri:toTaxon | dwciri:typeStatus | dwciri:verbatimCoordinateSystem | dwciri:verbatimSRS | dwciri:verticalDatum | dwciri:vitality

pbuttigieg commented 6 days ago

@pieterprovoost here's a first triage from me.


Add to embedding in ODIS records (for taxomonic levels, @pieterprovoost noted these may only go down to order or family to not flood the metadata, the rest would be available in the OBIS records):

Map to properties

Spatial mapping to GeoJSON and/or spatial properties in their stanzas, may be a bit involved, but worth it (I assume many of these are already mapped):

pieterprovoost commented 6 days ago

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

pbuttigieg commented 6 days ago

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

In the sense that a Dataset can be about potentially thousands of Taxa? Aggregation at higher ranks I think.

For Events, maybe taking the extreme values of space and time and creating an inclusive pocket to push to the Dataset metadata.

For things that hang off of Occurrence, like dwc:habitat, that's trickier - arrays in Dataset properties like about come to mind, but this may be one too many jumps.

If OBIS eventually releases truncated metadata about the other types (Events, Taxa, maybe Occurrences for specific species [e.g. of concern, keystones, invasives]) this would of course be easier from the Dataset metadata (via @id referencing). Maybe that can wait for that stage.

These are fields that I think would be useful for ODIS-level discovery of OBIS resources - if adding them is prohibitively complex or would put prohibitive demands on the systems involved, we can mark them for later consideration.

Could you check mark the terms above that you think are the most feasible to add now? We can discuss how to add some high-value ones that are harder in a meeting perhaps.

pbuttigieg commented 6 days ago

And I'm quite sure that some of the DwC value syntax will conflict with / OGC constraints - that's important to note, even if those properties don't make it into the JSON-LD/ products. Those are a basis to trigger later alignment of the standards themselves, hopefully.