opencadc / CAOM

Common Archive Observation Model
Creative Commons Attribution Share Alike 4.0 International
1 stars 2 forks source link

Observation.observationID and Plane.productID are too restrictive #3

Open pdowler opened 2 months ago

pdowler commented 2 months ago

In the code (java and python) these strings are restricted to being "valid path components".

These fields are used to generate several URIs:

ObservationURI of the form caom:{collection}/{observationID} for use as a reference in DerivedObservation.members

PlaneURI of the form caom:{collection}/{observationID}/{productID} for use as a reference in Plane.provenance.inputs

It is likely that Plane.creatorID is also assigned by using these values.

Although not part of the model, the CADC implementation (at least) uses these fields to generate Plane.publisherID that is the primary external reference to a Plane (product) and used as the input ID by the caom DataLink service.

pdowler commented 2 months ago

The restriction may be too limiting for users who have existing identifiers with more structure (eg. multiple path components) that they need to capture. This could in principle be an issue for collection names (eg Survey/DRi vs Survey/DRj).

see opencadc/CAOM#12

The purpose of the restriction was to be able to enable code to convert the fields to a URI and parse it back to the individual field values. That is currently possible because ObservationURI has exactly 2 components and PlaneURI has exactly 3 components. It would not be possible to parse and extract if components but the productID can have multiple path components.

Current code only uses the restrictions to implement validation of members and inputs and to assign consistent values in the observation and plane tables to support joins via these relationships. It would feasible to lift the restriction and make "valid URI content" be a metadata curation issue.

pdowler commented 2 months ago

Currently Observation.collection, Observation.observationID, and Plane.productID are the fields in the model. It would be more explicit to drop those and have the model include the URIs directly:

Observation.uri 
Plane.uri

I would like to retain the basic form (caom:{collection}/...) so the scheme means "CAOM entity" and a prefix on the Observation.uri is a namespace that can be used to reference collections (usage: permissions, metadata-sync). Being able to reliably extract the collection name from these URIs also means that data providers can consistently generate their own publisherID (see below).

Plane.creatorID already exists and should be an ivorn (ivo://{auth}/{collection}?...)

Data providers still need to inject their own Plane.publisherID (same as creatorID for original publisher) and that needs to be clarified/documented (but probably in a CAOM+TAP standard). In the current style of usage, a data provider would register a "data collection" in the IVOA registry with a resource identifier like ivo://{authority}/{collection} eg ivo://cadc.nrc.ca/CFHT. From there, a publisherID (of some "data" from that collection can be created by appending a query string with the logical Plane identifier. Current practice at CADC is to construct the publisherID as ivo://{authority}/{collection}?{Observation.observationID}/{Plane.productID}. As long as the collection name can be unambiguously extracted from the ObservationURI (and PlaneURI) then this is easy to do. More structure (path components in the observationID and/or productID) would be OK - those would become opaque strings that could be read (by a human) but not parsed by any generic code.

pdowler commented 2 months ago

For validation purposes, it would be good to require that Observation.uri plus a separator (/) character is a prefix of Plane.uri. Anything else is probably a mistake (caused by a bug).

pdowler commented 4 weeks ago

current draft solution:

keep Observation.collection replace Observation.observationID with Observation.uri, existing usage of caom:{collection}/{observationID} is OK replace Plane.productID with Plane.uri, existing usage like caom:{collection}/{observationID}/{productID} is OK remove Plane.creatorID (redundant)

DerivedObservation.members references use Observation.uri values Plane.provenance.inputs references use Plane.uri values

Encourage/require "local" reference style using the caom URI scheme as above?

The publisherID (of a Plane) is not part of the model: it is a concept for the relational mapping and implementors need to generate their own values. It is probably of the form ivo://{authority}/{collection}?{observationID}/{productID}. Code to generate publisherID values from Plane.uri and Observation.collection is feasible if the latter is limited to "single-path-component" like it is in CAOM-2.4; restrictions on number of path components in the rest of the URI could be removed

pdowler commented 4 weeks ago

This issue addresses the topic in #12

pdowler commented 4 weeks ago

With respect to one point mentioned in #12: spaces in identifiers that are likely to be used in APIs are quite dangerous, which is the primary reason for the restrictions in CAOM-2.4 and earlier. Even with the above changes, using spaces in the Observation.uri or the Plane.uri are problematic at best because space needs to be encoded/decoded carefully just to store the value in a URI.

used in APIs: in http params you should encode/decode anyway so it is fine, but if the URI or a prtion of it goes into the path then of a REST API call then it's a mess.

There is already considerable use of + in the names of things in astronomy and that includes these URIs; + does not need to be encoded to use it in a URI. And someone thought it was a good idea to use + as an alternate encoding for space in URL-encoding so adding any encoding/decoding requirement to the model/data structure looks like a nightmare.