spacetelescope / rad

Nancy Grace Roman Space Telescope shared attributes for processing and archive
https://rad.readthedocs.io/
Other
5 stars 20 forks source link

Investigate methods for versioning RAD schemas. #347

Closed stscijgbot-rstdms closed 7 months ago

stscijgbot-rstdms commented 9 months ago

Issue RAD-142 was created on JIRA by William Jamieson:

As JWST has gone into actual use, changes to the schemas defining its data models have become necessary to adapt it to the realities of what is required. Often these changes have created backwards incompatibility changes which create problems with opening/handling data files created before these changes were made with later codes. This creates many complications with having to deal with older data files while using newer versions of the code.

Roman is likely going to encounter similar challenges, necessitating changes to the schemas which will not be compatible with earlier versions of the code. Thus we should investigate methods/strategies for dealing with these issues now, before they become more difficult to overcome. In particular, we should have some sort of versioning plan that enables us to do two things:

Be able to at least open existing/older data files with newer versions of the code. This will at least enable users to do some things with those files without having to deal with multiple conflicting versions of the code.

Have a mechanism in place to open and "update" older data files to newer versions of those data files whenever possible (or at least to the extent that it is practical).

To address both of these, we need to employ a strategy which realistically enables us to version the schemas and deal with the consequences resulting from the changes.

stscijgbot-rstdms commented 9 months ago

Comment by William Jamieson on JIRA:

Versioning of the RAD schemas and consequently the data models built to support them will be a process that can quickly spiral out of control into large amounts of interconnected complexity. Thus the strategy for managing schema versions needs to make some attempt to limit the creation of large amounts of inter connections among versions AND clearly delineate versions from one another.

A cohesive strategy for schema versioning will naturally involve our efforts to handle the data model code moving forward. In particular, this will occur so that roman_datamodels has some mechanisms to handle the opening of files made under previous schema versions. Thus schema versioning will be a necessary consideration of any work for RAD-141 (automatic code generation from the schemas).

To start with schema versioning, we first should note some important assumptions that are being made.

Once a schema version is "published" (released for production), it for all practical purposes needs to become static. This means that no updates should be made to published schemas unless absolutely necessary.

A distinction between "published" and in-development schemas needs to be in place.

RAD will retain a static version of all "published" schemas.

In order to limit the possible backwards compatiblity issues inherent in assumptions 1 and 3, my first proposal is that RAD should march all the schema versions forward at the same time and not allow any cross referencing between version sets. By this I mean that if say the aperture-1.0.0 needs any change once it is published, then all the surrounding schemas need to be bumped to a new version at the same time. Moreover, when they are bumped it should be done via a copy. My reasoning here is that if we start mixing versioning together then it becomes very easy to accidentally create bugs in opening/validating existing files when alterations are made unless we are extremely careful in our versioning scheme. Moreover, by having distinct, fully realized sets of versions if something like RAD-141 is employed we can auto generate wholly independent sets of data model code for each total-version of RAD. By doing it this way we can theoretically ensure that we can at the very least open any existing file as changes to the data models code will not be directly effecting any old versions of the schemas (I will detail this further in a little bit). The move to always copying schemas (even when it induces nearly identical schemas) is specifically intended to make it possible to have entirely independent but co-existing mechanisms to opening any file written under a previously published schema.

In an effort to learn from some of the complexities of maintaining the ASDF-standard (and related) schemas, I think RAD should explicitly separate published versions and in-development versions. By this I mean that instead of having every version of every schema next to each other in the same directory structure I propose altering the structure of RAD such that instead of the schemas being in src/rad/resources/<schemas or manifests>, we move to organizing them as src/rad/resources/<version_directory>/<schemas or manifests. By doing this it becomes much easier for developers (and end users) to separate out the cohesive set of schemas representing each version of RAD. Note that this is only practical if we employ the notion of entire -independent sets of schemas for each RAD version. Indeed, it might be worth considering dropping the -a.b.c from the end of the schemas and instead using <version_directory> to indicate the version instead; however, doing this will be different from how ASDF currently recommends schema versioning to be indicated. In any case, there can always be a current or dev directory indicating the schemas currently under development.

My reasoning for being extremely strict about separation of rad versions into whole separate independent sets of schemas is tied to the ideas in RAD-141, automatic code generation from schemas. The strict separation of schema sets for each RAD version will enable us to run an automatic code generator on each set of schemas independently. This means we can get an independently functional set of Python code which at least allows for the opening and writing of the data under each set of schemas as we noted in RAD-141, it is easy to tie the models generated into ASDF and ASDF, by the nature of how it functions, will correctly open to or write from the correct set of models. Since published schemas should be fixed once they are published, this separation should ensure that data written under old schemas is still easily accessible.

Distinct-separate versioning also lets us use or emulate the functionality of versionedobj, which is a library that adds some functioning of the independent versioning of Python objects. The main functionality it provides is a mechanism to "upgrade" an old object, in our case data, to a "new" object. It does this by allowing developers to check for/add "conversion" methods from one version to another. Moreover, it provides mechanisms for the automatic "chaining" of these conversions so that old versions can be upgraded through multiple intermediate versions to some newer version.

The distinct-separate versioning allows us to provide a single conversion from a pervious version to a newer version. In most cases, such a conversion would simply be a 1-to-1 as the object data is exactly identical. Only when actual changes are made, will we need to actually write any true conversion utilities. We maybe able to do this transparently to users, meaning they might be able to use their existing versions of data without having to reacquire new data files from the archive. Note, that there maybe cases where no direct upgrade path is possible in which case we will have to document how to handle these cases.

Finally, to make things simpler on ourselves, I propose that the version numbers for each successive version of the schemas be the published version of rad. I.E. we publish RAD 1.1.1, then it should have a published set of schemas under the version 1.1.1.

stscijgbot-rstdms commented 7 months ago

Comment by William Jamieson on JIRA:

Since RAD-141 has been closed, the ideas put forth in this ticket are no longer completely relevant. It appears that separate work has taken place outside this ticket, see https://github.com/spacetelescope/rad/pull/359. So I am closing this one.