EMP500 re-id workflow results are not incrementing correctly - different blade + same version

aclum commented 4 months ago

If multiple workflow results run it should use the same ID and increment the number after the dot. (ie nmdc:wfrqc-11-kczpby64.1 then nmdc:wfrqc-11-kczpby64.2). If you look at workflow records that have {'was_informed_by':'nmdc:omprc-11-gcfqfy91'} in napa you'll see 4 reads based analysis, 4 reads qc, 4 assembly all incremented with .1.

We either need to get rid of these extra records or increment them correctly.

Solution discussed today. Write an ad hoc function that runs after re-iding that removes duplicate records.

Implementation: 1) script takes a study ID as input 2) using a filter study aggregate metagenome_assembly_set records by was_informed_by, return records where the count is greater than 1 db.getCollection( 'metagenome_assembly_set').aggregate( [ { $group: { _id: '$was_informed_by', count: { $sum: 1 } } }, { $match: { count: { $gt: 1 } } }, { $lookup: { from: 'omics_processing_set', localField: '_id', foreignField: 'id', as: 'omics_processing_set' } }, { $match: { 'omics_processing_set.part_of': '$NMDC_STUDY_ID' } } ], { maxTimeMS: 60000, allowDiskUse: true }) 3) For each of the was_informed_values, stored as _id in the response cursor query metagenome_assembly_set for records that were match that value for was_informed_by, sort by newest mongo record or ended_time_at. Example shows by mongo document date extracted from _id. Get has_input from that document. Match on has_output to get the readqc record from read_qc_analysis_activity_set. Match on has_input to get the read based analysis record from read_based_analysis_activity_set.

db.getCollection(
--
'metagenome_assembly_set'
).aggregate(
[
{
$match: {
was_informed_by: '$OMICS_RECORD'
}
},
{ $sort: { _id: -1 } },
{ $limit: 1 },
{
$lookup: {
from: 'read_qc_analysis_activity_set',
localField: 'has_input',
foreignField: 'has_output',
as: 'upstream_filtering'
}
},
{
$lookup: {
from: 'read_based_taxonomy_analysis_activity_set',
localField: 'has_input',
foreignField: 'has_input',
as: 'rbt_record_to_keep'
}
}
],
{ maxTimeMS: 60000, allowDiskUse: true }
);

Add value of id to metagenome assembly records to keep, upstream_filtering.id to filtering records to keep, and rbt_record_to_keep.id to reads based analysis records to keep. 3.1) using the same was_informed_by value to look for records to delete using Get assembly records to delete

db.getCollection('metagenome_assembly_set').find({
--
was_informed_by: '$OMICS_PROCESSING',
id: {
$not: {
$regex:
'$ID'
}
}
});

Delete the record and the has_output data objects

Get filtering records to delete

db.getCollection('read_qc_analysis_activity_set').find({
--
was_informed_by: '$OMICS_PROCESSING',
id: {
$not: {
$regex:
'$UPSTREAM_FILTERING.ID'
}
}
});

Delete the record and the has_output data objects

Get the read based analysis records to delete

db.getCollection('read_based_taxonomy_analysis_activity_set').find({
--
was_informed_by: '$OMICS_PROCESSING',
id: {
$not: {
$regex:
'$RBT_RECORD_TO_KEEP.ID'
}
}
});

Delete the record and the has_output data objects.

Validation test data: In a non-reid prod for a was_informed_by of gold:Gp0452544 It would keep metagenome_assembly_set id nmdc:53741f195cd67218562374f91e88052c read_based_taxonomy_analysis_activity_set id nmdc:53741f195cd67218562374f91e88052c read_qc_analysis_activity_set id nmdc:53741f195cd67218562374f91e88052c

delete metagenome_assembly_set id nmdc:23f566e73d1e6095cfe49ff57bb2ad4c read_based_taxonomy_analysis_activity_set id nmdc:23f566e73d1e6095cfe49ff57bb2ad4c read_qc_analysis_activity_set id nmdc:23f566e73d1e6095cfe49ff57bb2ad4c

**note the EMP500 records i think will all be from an era where the IDs are shared across the collection for the sequencing workflow results.

mbthornton-lbl commented 4 months ago

Depends on #163 to ensure that these records functional agg records are deleted

mbthornton-lbl commented 4 months ago

Original solution: Michal created a JSON data file with the duplicate workflow records to be deleted. This data file could then be used as an input to the delete-old-records command

ssarrafan commented 4 months ago

Appears active. Moving to new sprint for review. Please remove from sprint if not active.

aclum commented 4 months ago

This has been tested in dev, awaiting review of the PR before applying on prod.

aclum commented 4 months ago

This is fixed in prod using queries:run delete commands. Documents to delete were identified with nmdc_schema/identify_workflow_duplicates_emp500.py

microbiomedata / nmdc_automation

EMP500 re-id workflow results are not incrementing correctly - different blade + same version #162