opencadc / caom2

Common Archive Observation Model
GNU Affero General Public License v3.0
5 stars 11 forks source link

Observation.observationID and Plane.productID are too restrictive #180

Open pdowler opened 2 weeks ago

pdowler commented 2 weeks ago

In the code (java and python) these strings are restricted to being "valid path components".

These fields are used to generate several URIs:

ObservationURI of the form caom:{collection}/{observationID} for use as a reference in DerivedObservation.members

PlaneURI of the form caom:{collection}/{observationID}/{productID} for use as a reference in Plane.provenance.inputs

It is likely that Plane.creatorID is also assigned by using these values.

Although not part of the model, the CADC implementation (at least) uses these fields to generate Plane.publisherID that is the primary external reference to a Plane (product) and used as the input ID by the caom DataLink service.

pdowler commented 2 weeks ago

The restriction may be too limiting for users who have existing identifiers with more structure (eg. multiple path components) that they need to capture. This could in principle be an issue for collection names (eg Survey/DRi vs Survey/DRj).

see #170

The purpose of the restriction was to be able to enable code to convert the fields to a URI and parse it back to the individual field values. That is currently possible because ObservationURI has exactly 2 components and PlaneURI has exactly 3 components. It would not be possible to parse and extract if components but the productID can have multiple path components.

Current code only uses the restrictions to implement validation of members and inputs and to assign consistent values in the observation and plane tables to support joins via these relationships. It would feasible to lift the restriction and make "valid URI content" be a metadata curation issue.

pdowler commented 2 weeks ago

Currently Observation.collection, Observation.observationID, and Plane.productID are the fields in the model. It would be more explicit to drop those and have the model include the URIs directly:

Observation.uri 
Plane.uri

I would like to retain the basic form (caom:{collection}/...) so the scheme means "CAOM entity" and a prefix on the Observation.uri is a namespace that can be used to reference collections (usage: permissions, metadata-sync). Being able to reliably extract the collection name from these URIs also means that data providers can consistently generate their own publisherID (see below).

Plane.creatorID already exists and should be an ivorn (ivo://{auth}/{collection}?...)

Data providers still need to inject their own Plane.publisherID (same as creatorID for original publisher) and that needs to be clarified/documented (but probably in a CAOM+TAP standard). In the current style of usage, a data provider would register a "data collection" in the IVOA registry with a resource identifier like ivo://{authority}/{collection} eg ivo://cadc.nrc.ca/CFHT. From there, a publisherID (of some "data" from that collection can be created by appending a query string with the logical Plane identifier. Current practice at CADC is to construct the publisherID as ivo://{authority}/{collection}?{Observation.observationID}/{Plane.productID}. As long as the collection name can be unambiguously extracted from the ObservationURI (and PlaneURI) then this is easy to do. More structure (path components in the observationID and/or productID) would be OK - those would become opaque strings that could be read (by a human) but not parsed by any generic code.

pdowler commented 1 week ago

For validation purposes, it would be good to require that Observation.uri plus a separator (/) character is a prefix of Plane.uri. Anything else is probably a mistake (caused by a bug).