opencadc / CAOM

Common Archive Observation Model
Creative Commons Attribution Share Alike 4.0 International
1 stars 2 forks source link

python module validation of observation_id field - too stringent? #12

Closed alaity47 closed 4 weeks ago

alaity47 commented 1 year ago

Several of our collections at IRSA have observationid values that include spaces or slashes, and we're running into the following error trying to export them with the python caom2 library:

invalid observation_id: may not contain space ( ), slash (/), escape (\), or percent (%)

I reached out to @pdowler, who said this is to ensure that observation_id does not contain any characters aren't URI-safe, and suggested I open an issue here for discussion.

We don't use that field for URI population and haven't been able to find any documentation indicating that the observation_id string has any restrictions.

Is the caom2 module possibly being too stringent in how it validates this column?

(FYI we use CAOM 2.3, and the python module caom2 2.3.8.4)

pdowler commented 1 year ago

see https://www.opencadc.org/caom2/#ObservationURI and https://www.opencadc.org/caom2/#PlaneURI

Those two types specify the structure of those URIs, but since they do so in reference to other fields they only loosely imply that (i) those field values must be valid to use in URIs and (ii) additional path separators (/) are not allowed.

This is admittedly a poorly documented part of the model. Since URIs are used internally and also services are expected to be able to provide ivoa publisher ID values based on these same strings, the restriction on chars that cannot be used in URIs (eg space) is pretty much impossible to remove.

pdowler commented 1 year ago

Since ObservationURI contains {collection} and {observationID} (one /) and PlaneURI contains that plus {productID}, the URIs can currently be validated based on number of path components to make sure s/w is using the right one in the right place... we could think about allowing additional separators (/) but that would weaken the validation of those two types of URI.

pdowler commented 1 year ago

@alaity47 Can you provide some concrete use case/example of observationID and/or productID values? Where they come from, who decides what those strings are, etc.

alaity47 commented 1 year ago

The format/content of observationid for us varies by mission. We only have a couple of collections where a space or a slash can be in there, and it's because we're either using part or all of a filename path, or some kind of target/object name.

Some examples:

allsky/970607n/s043
BD+62 1644
HR 7042

I completely understand enforcing these restrictions on URIs, but the way we've implemented things here there's no direct relationship between observation_id and any URIs. To me the documentation reads as if that field can be any string (https://www.opencadc.org/caom2/2.3/#Observation.observationID)

dr-rodriguez commented 1 year ago

At MAST, the observation_id is typically the "fileroot" of the primary science file. For example, if the file is ipppssoot_drz.fits, the observation_id would be ipppssoot. I know we use URI parsing logic for the artifact URI and would not be surprised if we also have some for the observation URI. Many times that parsing logic is very simple and just splits by "/" so having additional ones would likely break things on our end. Regardless, because our observation IDs are derived from filenames, we generally have the restriction that they would be valid filenames in the first place and so "/" would not be allowed. Spaces in principle should have no issues, but for us we use underscore and dashes to split up parts of a fileroot (eg, hlsp_k2sc_k2_llc_211804118-c05_kepler_v2lc). Could you consider replacing / or spaces with or - to work around these limitations?

pdowler commented 4 weeks ago

closing in favour of #3