`berkeley-schema-fy24`: Implement migrator that merges some collections

microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model

https://microbiomedata.github.io/nmdc-schema/

Creative Commons Zero v1.0 Universal

26 stars 8 forks source link

`berkeley-schema-fy24`: Implement migrator that merges some collections #2045

Closed eecavanna closed 3 weeks ago

eecavanna commented 4 weeks ago

Hi @aclum, I created this ticket to represent the task that came up during today's metadata meeting.

It sounded to me like you wanted all of the documents in one collection to be moved to another collection, and to have the first collection be deleted. Is there more to it than that (e.g. modifying fields within documents)?

eecavanna commented 4 weeks ago

Here are links to the schema documentation pages for the classes I think will be involved here.

PlannedProcess has these child classes:

CollectingBiosamplesFromSite - has no child classes
ProtocolExecution - has no child classes
StorageProcess - has no child classes
MaterialProcessing
DataGeneration
WorkflowChain - has no child classes
WorkflowExecution

MaterialProcessing has these child classes:

DataGeneration has these child classes:

WorkflowExecution has these child classes:

eecavanna commented 3 weeks ago

In terms of existing adapter methods, here's what I expect this migrator to do for each of those child classes (written here in pseudocode):

# Move all documents from the "pooling_set" collection into the "material_processing_set" collection,
# then delete the "pooling_set" collection.
self.adapter.do_for_each_document(
    collection_name="pooling_set", 
    action=lambda document: self.adapter.insert_document(collection="material_processing_set", document=document)
)
self.adapter.delete_collection(collection_name="pooling_set")

aclum commented 3 weeks ago

DataGeneration subclasses are already in a combined collection so there no action there. All existing collections from children of MaterialProcessing should combined to a new collection called material_processing_set, same for WorkflowExecution. Im assuming you want to put this migrator at the end in which case you use commit id https://github.com/microbiomedata/berkeley-schema-fy24/commit/ca304e47916f9ff2825dcd854a7a936dfdd5b07f to determine what the starting Database slot names are. If this is running earlier you may need to use nmdc-schema Database slot names. Note that some of these subclasses never had a collection in mongo so the code should be able to handle that.

eecavanna commented 3 weeks ago

Thanks! That was very helpful to me. I'm operating under the assumption that this will run after all migrators that have been implemented so far, so I'll refer to that commit you linked to.

P.S. I'll be out until about 9:45pm PT.

eecavanna commented 3 weeks ago

I implemented this migrator. It's in this PR: https://github.com/microbiomedata/berkeley-schema-fy24/pull/196