microbiomedata / nmdc_automation

Prototype automation
3 stars 2 forks source link

Duplicate records with incorrect versioning - NEON #54

Open aclum opened 6 months ago

aclum commented 6 months ago

This is a generic issue for when there are multiple records of a given workflow execution activity that are not versioned correctly. Specifically when there are two records with different IDs instead of the IDs being incremented. We need to:

Example I can see two directories on CFS and there are two records in mongo. Two mags_activity_set on the file system and in mongo w/a query of {'was-informed_by' : 'nmdc:omprc-11-14ermv40'} nmdc:wfmag-11-wpcgh271.1 nmdc:wfmag-11-v5475v49.1

aclum commented 6 months ago

related to https://github.com/microbiomedata/issues/issues/547

mbthornton-lbl commented 6 months ago

@Michal-Babins any chance that this might be related? https://github.com/microbiomedata/nmdc_automation/pull/22

Michal-Babins commented 6 months ago

Yes, it very well might be. I would check with Shane.

aclum commented 4 months ago

This is still happening. We should address this before we process anything else or we are just creating a cleanup exercise for ourselves. @Michal-Babins Do you have time to work on this next sprint?

aclum commented 4 months ago

TRiP that ran a few days ago has 3 annotations and 3 MAGs aclum@perlmutter:login13:/global/cfs/cdirs/m3408/results/nmdc:omprc-11-9mvz7z22> ls -ltr total 9 drwxrws--- 2 nmdcda m3408 4096 Jan 11 13:16 nmdc:wfrqc-11-t0tvnp52.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 13:11 nmdc:wfrbt-11-pmdhac23.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 14:54 nmdc:wfmgas-11-rcs4bt79.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:03 nmdc:wfmgan-11-4sc85678.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:04 nmdc:wfmgan-11-mmt28267.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:04 nmdc:wfmgan-11-hdaenp36.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:13 nmdc:wfmag-11-m8tn3y26.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:13 nmdc:wfmag-11-zcwca422.1 drwxrws--- 2 nmdcda m3408 4096 Feb 7 22:13 nmdc:wfmag-11-9dgz7m72.1

Michal-Babins commented 4 months ago

I generated records of duplicated found in annotation and mags, @mbthornton-lbl do we want to add these json dumps to the re-iding workflow and do it al in one sweep?

mbthornton-lbl commented 4 months ago

Yes.

Michal-Babins commented 4 months ago

I added those here: https://github.com/microbiomedata/nmdc_automation/commit/f86f4f62a6a43f15c180806ea9b3d3debac40c67

mbthornton-lbl commented 4 months ago

@Michal-Babins delete-old-records for the duplicates has been applied to the Napa DB instance

ssarrafan commented 4 months ago

Appears to be active. Moving to next sprint.

ssarrafan commented 3 months ago

@mbthornton-lbl can you please check to see if this got merged? FYI @aclum

aclum commented 3 months ago

It appears we are done with the first two parts of this, Shane's new MAG runs incremented correctly. We still need to do data cleanup but that can wait until a future sprint. Let's backlog the remaining work for this.