microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Migrator: Write migrator for `NomAnalysisActivity` documents #2010

Closed aclum closed 3 weeks ago

aclum commented 1 month ago

We need a migrator which will search for NomAnalysisActivity which do not have a version appended, for each of those records it should update ID to append a .1 to the existing value in slot ID and move the existing value of ID to alternative_identifiers.

Example before:

{
  "_id": {
    "$oid": "649b0095daa6c19f56b2777c"
  },
  "type": "nmdc:NomAnalysisActivity",
  "has_input": [
    "nmdc:dobj-13-hg0x7944"
  ],
  "has_output": [
    "nmdc:dobj-13-48nyp930"
  ],
  "id": "nmdc:wfnom-13-7yf9qj85",
  "ended_at_time": "2021-01-21T23:27:57Z",
  "execution_resource": "EMSL-RZR",
  "git_url": "https://github.com/microbiomedata/enviroMS",
  "started_at_time": "2021-01-21T23:27:57Z",
  "used": "12T_FTICR_B",
  "was_informed_by": "nmdc:omprc-11-3x68c186"
}

example after:

{
  "_id": {
    "$oid": "649b0095daa6c19f56b2777c"
  },
  "type": "nmdc:NomAnalysisActivity",
  "has_input": [
    "nmdc:dobj-13-hg0x7944"
  ],
  "has_output": [
    "nmdc:dobj-13-48nyp930"
  ],
  "id": "nmdc:wfnom-13-7yf9qj85.1",
  "ended_at_time": "2021-01-21T23:27:57Z",
  "execution_resource": "EMSL-RZR",
  "git_url": "https://github.com/microbiomedata/enviroMS",
  "started_at_time": "2021-01-21T23:27:57Z",
  "used": "12T_FTICR_B",
  "was_informed_by": "nmdc:omprc-11-3x68c186",
  "alternative_identifiers": ["nmdc:wfnom-13-7yf9qj85"]
}

Example migrators can be found https://github.com/microbiomedata/nmdc-schema/tree/main/nmdc_schema/migrators

Target completion for this is 6/17. This migrator is needed for the 6/24 release or the records will be invalid b/c that release will have more stringent pattern matches on IDs. cc @ssarrafan

eecavanna commented 1 month ago

Hi @JamesTessmer, all of the migrators — whether written for the nmdc-schema schema or the berkeley-schema-fy24 schema — can be found in the berkeley-schema-fy24 repository; here: https://github.com/microbiomedata/berkeley-schema-fy24/tree/main/nmdc_schema/migrators

eecavanna commented 1 month ago

Hi @aclum , I have a question. There are a few places in a migrator where schema version numbers are indicated; for example, each migrator's name has the format migrator_from_{initial_schema_version}_to_{final_schema_version}.py, and each migrator has a variable named _from_version and a variable named _to_version, etc. What are the "from version" and "to version" in this case? In other words, what schema versions will this migrator be used to migrate the database from and to?

eecavanna commented 1 month ago

@JamesTessmer, when the person writing a migrator doesn't know what the specific schema versions will be yet, I usually recommend that they either (a) make up some non-sensical versions (e.g. 0.0.0) and then mention in the PR that they are placeholder versions that will be updated to match the eventual starting/ending schema versions that go along with the migrator; or (b) specify the starting version as the currently-released schema version and specify the ending version as some PR number (the number of the schema repository PR that introduced the relevant schema change).

Here's a (hypothetical) example:

migrator_from_10_3_0_to_PR123.py

The version numbers can remain as placeholders until the migrator is in a PR. In other words, they can remain as placeholder while writing and testing the migrator.

aclum commented 1 month ago

It will be 10.3.0 to whatever the version release at the end of June for nmdc-schema will be proposed. I propose 10.4.0 unless @turbomam objects.

eecavanna commented 1 month ago

Thanks, @aclum.

FYI @JamesTessmer, when writing the migrator, I recommend naming it migrator_from_10_3_0_to_10_4_0.py and setting [its class variables] _from_version = "10.3.0" and _to_version = "10.4.0". We can go back and edit those things during the PR review phase, if needed.

JamesTessmer commented 3 weeks ago

Added PR for this issue here: https://github.com/microbiomedata/nmdc-schema/pull/2059

@aclum @eecavanna What's the best way to test the migrator before marking the PR as ready for review?

eecavanna commented 3 weeks ago

Hi @JamesTessmer,

The test approach I consider to be the "lowest-hanging fruit" is to run the doctests. You can do that by running $ poetry run python -m doctest -v /path/to/the/migrator.py.

aclum commented 3 weeks ago

merged with https://github.com/microbiomedata/nmdc-schema/pull/2059