microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Migrator: Update `migrator_from_10_3_0_to_10_4_0.py` so it also updates `was_generated_by` values #2086

Closed eecavanna closed 1 week ago

eecavanna commented 1 week ago

Update migrator_from_10_3_0_to_10_4_0.py as follows.

For each document in the nom_analysis_activity_set collection, check its has_output field (which is a list). Its value will be a list of id values of data_object_set documents. For each of those data_object_set documents, update its was_generated_by field (which is a string) (if the field exists and the value is doesn't already match .[0-9]+$) so it contains the id of that nom_analysis_activity_set document (that was pointing to it).


~There could be something in the assets folder that checks the referential integrity of this.~

We can use the following mongosh query from @aclum to validate the transformed result (expecting 0 documents to be present in the result).

db.getCollection('data_object_set').aggregate(
  [
    {
      $match: {
        was_generated_by: { $exists: true }
      }
    },
    {
      $match: {
        was_generated_by: {
          $regex: RegExp('nmdc:wfnom')
        }
      }
    },
    {
      $lookup: {
        from: 'nom_analysis_activity_set',
        localField: 'was_generated_by',
        foreignField: 'id',
        as: 'nom_analysis_activity_set'
      }
    },
    {
      $match: {
        nom_analysis_activity_set: { $size: 0 }
      }
    }
  ],
  { maxTimeMS: 60000, allowDiskUse: true }
);

CC: @JamesTessmer @aclum

eecavanna commented 1 week ago

I issued the above Mongo query on a database containing data migrated using this migrator, and get 0 documents as a result; which I think is what the person that gave me that query said would be indicative of a correct migration.