spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

Document interface version requirements #62

Open jtniehof opened 3 years ago

jtniehof commented 3 years ago

i mostly want to capture a conversation I had with @balarsen . The interface version is treated specially because it isn't usually auto-incremented. The only thing we have as a hard requirement is that a particular code will always create output of a defined interface version. We were sort of thinking in terms of a code's interface version usually changes if either its input interface or output interface changes, but we don't have any hard mapping of code interface version to the interface version of its input files.

I'm in a situation where Code A makes product X which code B uses to make product Y. Product X's interface version is going to change, as it's getting more variables (and people shouldn't be looking for those new variables from 2.m.n in old files of version 1.o.p). But code B doesn't touch the new variables, so interface 1 vs. interface 2 of product X doesn't matter to it. Thus I'm going to bump interface version of A, output version of A, interface version of X, and nothing else...but there's no real indication that code B interface 1 can handle input product X of interface version 1 or 2. The chain doesn't really treat them differently.

@balarsen , I could have sworn we had a big writeup of how the versioning scheme worked somewhere, but I'm not finding it in your SOC_Processing_Chain_Setup.docx or the ECT SOC manual that I wrote. I know it didn't go in the original ICD.

Proposed enhancement

Document any requirements or recommendations for how interface versions between products and codes relate to each other.

Alternatives

Continue to hold tihs in my head....

OS, Python version, and dependency version information:

Linux-4.4.0-98-generic-x86_64-with-Ubuntu-16.04-xenial
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
sqlalchemy=1.0.11

Version of dbprocessing

Current master from github (734f37b1bfb3540f5682edd6dbb2e590eb51a3ff)

Closure condition

This issue should be closed when appropriate documentation is merged.

jtniehof commented 3 years ago

I seem to recall the only time interface versions change are:

  1. Code version changes are always manual, and that includes interface versions.
  2. Files created by a code have an interface version that's set by the code.

Otherwise a bump in interface (major) version of an input file results in just a bump in quality (minor) version of the output file.

I went digging and runMe seems to be the critical point where this is done.

If a file's been processed before and there's now a new version of the code, _codeVerChange will increment the file's version based on the code bump. So if the code interface (major) version has bumped, a new version of an existing output file will have an increment in interface (major) version. This increment is _relative to the existing output_interface_version of the code_, so there may be a case that results in TWO bumps. If code 2.x.y made file 1.a.b and then code is updated to 3.0.0 (and its output interface version bumped to 2), the new file would be 3.0.0. That doesn't seem right. (However the _codeVerChange is only called in the event a filename already exists in the database, and 2.0.0 probably wouldn't exist in that case?)

Here's what it looks like happens; this is all in the runMe constructor (start line 356):

  1. The output version of a to-be-created file (output_version) is initialized to code_output_interface_version.0.0 where code_output_interface_version is the output_interface_version from the code table for the code that's going to be run. I'm going to call this OIV.
  2. If the force argument to the runMe constructor is True, then return out, stick with existing version number (OIV.0.0). This is different from the --force argument that gets used to explicitly bump the version (that goes into the version_bump data member.) This force argument is pretty much just used by DBRunner.
  3. Begin a loop: a. Break out of loop if the filename built based on the current output_version is not in the database. b. If version_bump is set, bump that version and continue the loop (return to a). c. Call _codeVerChange. If the code has changed since the file was last built, increment the same portion of output_version and zero any "smaller" ones: if code had an interface bump and output_version is x.y.z, then output_version will become x+1.0.0. If any increment, next iteration of loop. d. Call _parentsChanged. If there are new parents (e.g. optionals that have shown up), increment the quality version. Else increment the quality version if any parent had an increase in quality version; else increment the revision version if any parent had an increase in revision version. No increment for parental increment in interface version. Next iteration of loop (back to a) if any increments. e. Continue to next iteration of loop. Note this will be an infinite loop if reaching this point.

I believe these are pathologies and I will try to runMe tests to capture them (and we can define desired behavior):

  1. If OIV.0.0 and OIV.2.0 exist in the database (which is weird but not inherently an inconsistency), and version_bump is set to 1, rule 3a above means the new file will be version OIV.1.0, not OIV.3.0 as I'd expect.
  2. Because the code check is done first, if code 1.0.0 make output 1.0.0 from input 1.0.0, and then input is updated to 1.1.0 and code to 1.0.1, the output will be 1.0.1 (rule 3c supersedes 3d) but should probably be 1.1.0.
  3. If a code interface version is incremented without incrementing its OIV, 3c means that there can be weirdness where files don't match the OIV. So for instance code of version 1.0.0 has OIV 1 and makes an output file version 1.0.0. Then the code is updated to 2.0.0 because of changes to its inputs but OIV remains 1. Rule 1 combined with 3c means that the updated output file would be 2.0.0 not 1.1.0. This also violates the principle that @balarsen and I agree on that interface version increments should be intentional.
  4. If a code version 1.0.0 has OIV 1 and has created a child product version 1.0.0 from parent product 1.0.0, and then parent product 2.0.0 comes along, rules 1 and 3d mean that the new child is 1.0.0. Infinite loop.

I think pathology 1 is minor. I'd like to fix it, at a lower priority. I think the way to do that is to get the latest version of the file using getFilesByProductDate and utc_file_date (or frankly a new function to get a single file by a single date). Then start with output_version = OIV.0.0. If output_version is greater than the current existing version, great! Otherwise set output_version to the current version and do a single increment according to all the rules in 3. The filename would be made after the loop, which would now be an if statement. I think this would be faster...fewer database lookups per pass, one pass instead of multiple. See also fix for 2.

2 is medium-small priority; triggering it requires code changes and data changes at same time. I think this is fixable by changing _parentsChanged and _codeChanged to just report out what changed and then bump whichever is a bigger change between the two.

3 is a bigger deal. I think all that needs to happen there is to update _codeVerChange so that an interface bump in the code results in a quality bump in the output_version.

4 is also a pretty big deal; again what needs to change is that if a parent has an increment in interface version, increment the quality version of the output_version.

I actually have deployments waiting on 3 and 4, so can probably put something together for them relatively quickly.

balarsen commented 3 years ago

@balarsen , I could have sworn we had a big writeup of how the versioning scheme worked somewhere, but I'm not finding it in your SOC_Processing_Chain_Setup.docx or the ECT SOC manual that I wrote. I know it didn't go in the original ICD.

You certainly did, I will see if I can dig it up. Likely only on my office desktop.

balarsen commented 3 years ago

This is also a good time for a discussion what we actually want the versions to do and mean.

The initial intent was: versions are X.Y.Z

With some 20/20 hindsight.

jtniehof commented 3 years ago

On PSP we hide the internal version and the public only sees the release number. I was planning on that for ECT as well but for some reason that didn't happen.

jtniehof commented 3 years ago

For reference of "what we want the versions to do and mean", this is what I have for the ECT release versioning, which isn't exactly how it went. Note this is mostly about the external versioning.

Current SOC scheme

Every file has version x.y.z. With this the SOC can look up versions of everything that went into the file (and, reproduce it from level zero files and codes.) This version is not user friendly: two files made with the same codes, but for different days may have different versions due to processing history. Goals

  1. give users confidence they have the latest data
  2. allow users to report what version they used: enables reproducibility and also makes it easier for users to check caveats associated with a particular data set
  3. minimize confusion!

Agreements

  1. The per-file versioning scheme should remain but should not be the main user-visible versioning.
  2. Regularly perform "releases" of the entire data set with a single version number applied to all files
  3. Changes between releases should be documented on the SOC website, e.g.: "Release 8: new calibration tables for magEIS; changed energy channel assignment for REPT..."
  4. Files in a release should not change after release: e.g. HOPE level 2, 11 November 2012, release 2 will be a particular file in perpetuity. (So updated cals, etc. wait for next release.)
  5. Update data requires updated version: e.g. there should not be two files with different contents both labelled HOPE level 2 11 November 2012 release 3.

Disagreements/open questions

  1. Handling of data before its first release? Release in June 2013, what to do with July 2013 data before September 2013 release? Suggested to add in previous release if codes have not changed. Difficult to implement (initial "thrash"). Another option is to have a "latest preliminary" directory.

Proposal

  1. A "latest preliminary" directory holds latest version of data for all days since the release, labelled with x.y.z version
  2. At release time, the entire contents of the "latest" directory are copied to a release directory. The filenames are changed to reflect the release number rather than the x.y.z version.
  3. The SOC already can maintain a list of the x.y.z version for each file in the release
  4. Even if a file has not changed between releases, it will still receive the number of the new release to indicate it is the latest released version. This may lead to some confusion (multiple files with different names and identical contents) but the alternative is releases that don't contain all the data!
  5. If dramatic changes bring released data seriously into doubt, perform an out-of-cycle release, using the normal naming/numbering scheme (no "patchlevels"!)
  6. The final release submitted to NSSDC shall be labelled "final", no release number (assumes no post-mission updates, e.g. HDEE?)

Knobs to twiddle

  1. Frequency of release (Probably every six months at least, every two months at most, likely 3 or 4)
  2. Numbering of the releases: sequential? (v01, v02...) Name by date? (June 2013--but then how do we do version numbers in filename?) Ubuntu-style? (v1306 for June 2013 release)
  3. Do we maintain old releases? How long? Under what circumstances?
  4. How much version information do we put in the CDF header? Version of the file? Version of the immediate inputs to that file? Version of EVERYTHING that went in? (Current plan is to store the full command line that generated the file).
  5. Do we keep "latest preliminary" for all data or just for that which has not yet been incorporated into a release?
  6. How do we set up autoplot VAPs? Point at released? At latest?
  7. Store x.y.z version in the CDF metadata so a "release" CDF can be associated with its SOC version? (we do store this association elsewhere)
  8. Store release version in the CDF metadata
jtniehof commented 3 years ago

For PSP, because the data aren't coming down daily (like on ECT) but in chunks of months at a time, we have scheduled releases which get all the newly downlinked data and any updates. To date we've only had one out-of-band release for just updates. The release notes give the basic idea.

The release script has a set of transforms based on SpacePy VarBundle operations that actually creates a new file including a subset of the variables and, in some cases, transformed (e.g. leaving out bad channels.) The dbp database structure is the same, just having a single table with columns for file ID and release number.

In practice I've found having an internal and external version is just about essential; it lets you know what's going on internally, particularly with early analysis before public release, while keeping things simple externally.