opencadc / caom2

Common Archive Observation Model
GNU Affero General Public License v3.0
5 stars 11 forks source link

uniqueness of Artifact.uri #181

Open pdowler opened 2 weeks ago

pdowler commented 2 weeks ago

The Artifact.uri field is a reference to an externally stored object (usually a file, but could be a database table or maybe a directory in VOSpace holding multiple files).

It was intended that Artifact.uri is unique -- no two artifacts have the same URI -- but the implementations did not fully enforce this and usage is not consistent with the original intent.

detail:

pdowler commented 2 weeks ago

examples that abuse intended Artifact.uri uniqueness:

At CADC, many observations include a raw Plane and a calibrated Plane; previews are generated from the calibrated data but are assigned (added as Artifacts) to both Planes... for the raw Plane, the preview does convey what that data could look like (after calibration) but it doesn't convey what it looks like as-is.

So it is a preview, but...

dr-rodriguez commented 2 weeks ago

At MAST, observations from some missions like JWST include artifacts that are shared between multiple observations. These are things like guide star files, association tables, and association pool files. They are auxiliary products that get produced by the pipeline but don't belong to a single observation. As such, we list them for each observation that requires it as we would otherwise need to have a concept of an orphaned artifact (ie, a file with no observation/plane).

During our meeting, we discussed the potential of having more complex observations that store all artifacts in it (eg, all extracted spectra). This would be a significant change, both in our code base and potentially on the conceptual understanding of our users (depending on how to present that). We may explore that at some point, but for now it's not in the cards.

My recommendation is that artifact.uri be unique within an observation, but that multiple instances of the same artifact uri can be shared across observations.
As a side note, we would like to explore having the same UUID for artifacts with the same uri, but doing so may require changes on our database backend- at the moment even though files may share the same artifact uri they would get different UUID.

DaftPict commented 2 weeks ago

A MAST example is that of Guide Star files that are obtained for an entire telescope pointing (HST & JWST) and so are associated with multiple observations. Each observation is processed individually into the CAOM XML file so each GS file becomes an non-unique artifact.uri in every observation - they do however, have unique UUIDs

pdowler commented 2 weeks ago

The "shared preview" usage in CADC would, violate "unique within an observation"; if preview is supposed to be a quick visual way to "examine the content" then that usage seems ok... if it was restricted to "examine the quality" then maybe the shared preview is more questionable, but I don't think we can feasibly limit the meaning of preview like that: especially when we define Artifact.productType to be the terms of the DataLink semantics vocabulary.

pdowler commented 2 weeks ago

I am thinking in this direction:

Artifact.uri should be globally unique for productType == this: a primary file should only belong to a single plane in a single observation. All other productType(s) are references and URIs can be used in multiple planes/observations.

Obviously, having two of the same artifact in the same plane is still not allowed (surely a bug) and the current code prevents it. Code validation could check for duplicate "this" artifacts in an observation, but the check vs other observations would require a unique index in a database. This is implementable in PostgreSQL because one can define an index with a "where" clause, eg

create unique index artifact_this on caom2.Artifact(uri) where productType='this'

but such a complex rule would potentially be problematic in other DB servers.

pdowler commented 2 weeks ago

aside about Artifact.id (uuid): these denote a single entity (row in database) so they have to be different due to the type of arrow (composition) in the Plane->Artifact relationship. To have multiple planes refer to the same Artifact, the relation would have to be reference and Artifact(s) would not be part of the Observation: They would be separate entities that have to be peristed and managed independently. I think that would really break the core concept that an Observation is a single self-contained entity (and fairly denormalised to accomplish that) that can be curated and synced.

I think being more clear that the URI in Artifact is a reference to an external resource is sufficient. Of course, if you have multiple artifacts with the same URI you have more complex work to maintain the other Artifact metadata (contentLength, contentChecksum).