Implement data format versioning (XCA, loader, B/E, F/E)

mwinokan commented 2 days ago

@mwinokan and @phraenquex discussed that the fluid data formats for XCA, loader, and f/e are causing headaches for users and those helping to debug alignments.

The following solution is proposed:

The data format is tagged across the pipeline. What we have now is v1 and anything that follows must increment and document any breaking changes
A new XCA prep command that creates a new working directory within which the upload_1 and subsequent subdirectories will reside. The name of the directory will also indicate the data format
Any collation or alignment will happen in the new directory, and crucially will not proceed if the data format version differs.
In the case that the version differs and a new upload_1 is needed, we should encourage users to upload the new tarball to a separate TAS so that snapshots to the old target are not broken. This data format version could be on the Target model.
On LHS upload there should be an option to link to an existing target for superseding, this will allow the f/e to show a warning on the old target to indicate that there is new data available elsewhere

@phraenquex adds: the user should not have to dig to find out if the format is compatible or not

mwinokan commented 2 days ago

@tdudgeon agrees that this sounds sensible but adds that database migrations will make the data in Fragalysis compatible between versions. But identifying and documenting the breaking changes in XCA and the loader is key.

@tdudgeon suggests a is_compatible_with function that is maintained in XCA to work out if incremental uploads can proceed with legacy formatted data without needing to complete alignment and upload to spot issues

Determining whether changes are breaking will be non-trivial to solve, but unit testing (#1588) will help to determine

@ConorFWild suggests having a table in the source code documenting when breaking changes occur, the collator can then compare the version in preceding uploads and see if they are compatible by checking the table. This empirical approach means that it can easily be patched if new breaking changes are identified that didn't appear in test data.

phraenquex commented 2 days ago

XCA needs to create an upload directory hierarchy:

 uploads / upload-dfv1 / upload_1
                       / upload_2
                       / upload_3
         / upload-dfv2 / upload_1
                       / upload_2
         / upload-dfv3 / upload_1

tdudgeon commented 3 hours ago

Initial commit implementing basic data model versioning in collator. https://github.com/xchem/xchem-align/commit/c263cc7eab253ac56e89a51519321475325b4e58

There is a baked in major.minor version number (currently 1.0). Minor version number differences are treated as OK, major version number differences are errors and collator fails.

This has been deployed to the XCA staging environment.

Some aspects to consider:

Currently previous uploads without a version have to be accepted as they might or might not be compatible, so a warning is issued. Once we change XCA so that we can be sure that uploads without a version are incompatible we can change that to an error.
We might want to synchronise the data model version with the github tags. This needs a bit of thought.

Still to be done is any automagic to migrate the directory structure. The directory structure that is proposed is almost certainly not what users currently have, so not sure if we need to make them update this to be compatible, or try to migrate automatically (which might be complex and unreliable).

I propose to create a new migrate tool that does any migrations of the directory structure. XCA would use the directory named upload-current. If the data model version changes (e.g. from 1.3 to 2.0) that dir is renamed to upload-v1.3 and a new upload-current dir is created and the config.yaml and assemblies.yaml files copied to the new dir.

This way when the data model major version changes the user will be told to run the migrate tool (the command to run is shown and can be C&Ped) so that the user makes a conscious decision to do this (avoiding risk of automatically screwing things up), the user will also be told what has happened and they might need to update config.yaml and assemblies.yaml, but the collator command they need to run is still the same as they ran before as it still uses the upload-current dir.

m2ms / fragalysis-frontend

Implement data format versioning (XCA, loader, B/E, F/E) #1592