Re-ID Studies in Napa, verify schema compliance, and ensure pre-requisites for Prod Re-ID

mbthornton-lbl commented 4 months ago

Test Runs on Napa instance and pre-requisites for Metagenomic workflows. Note: Context for this test run assumes the the following updates have been applied to the testDB instance:

Updates to study, biosample, and omics_processing
Updates for Metaproteomic, Metabolomic and NOM workflows and data

https://github.com/microbiomedata/issues/issues/532

Re-ID Studies:

"Stegen": nmdc:sty-11-aygzgv51, formerly gold:Gs0114663
"SPRUCE": nmdc:sty-11-33fbta56, formerly gold:Gs0110138
"EMP": nmdc:sty-11-547rwq94, formerly gold:Gs0154244
"Luquillo": nmdc:sty-11-076c9980, formerly gold:Gs0128850
"CrestedButte" nmdc:sty-11-dcqce727, formerly gold:Gs0135149
"DeepShale" nmdc:sty-11-8fb6t785, formerly gold:Gs0114675
"Populus": nmdc:sty-11-1t150432, formerly gold:Gs0103573

Pre-Requisites - complete ETL on Napa instance and verify:

[x] #1793
[x] #1789
[x] #1792
[x] #1794
[x] #1814
[x] #1835
[x] #1836
[x] #1865

Pre-requisites - All ETL recipes fully reproducable:

Study, BioSample and Omics:

[x] https://github.com/microbiomedata/nmdc_automation/issues/66
[x] #1833
[x] #1834

Metagenomics:

[x] Ensure metagenomics re_id_tool.py ready for prod run
[x] https://github.com/microbiomedata/nmdc_automation/issues/87
[x] Metabolomics / NOM #1928

mbthornton-lbl commented 4 months ago

SPARQL query for Orphaned DataObjects:

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# Orphan DataObjects - not object of has_input or has_output
select * where { 
    ?do a nmdc:DataObject .
    minus {
        ?o nmdc:has_input ?do .
    }
    minus {
        ?o nmdc:has_output ?do .
    }
} limit 100

aclum commented 4 months ago

I reopened SPRUCE but this likely pertains to all the studies. We are missing deleting some of the binning data object records. ie {'description':{$regex:/Gp0208377/}} from the SPRUCE example data object type are Metagenome Bins,CheckM Statistics or null. The null ones, based on this example, could be captured by a case insensitive search for metabat2 on slot description

mbthornton-lbl commented 4 months ago

@aclum Are we deleting all Binning data objects, or only those with a non-compliant ID? Should records like this one:

record: nmdc:dobj-11-qm3fbt63 CheckM Statistics CheckM for nmdc:wfmag-11-m0t5hc17.1

be deleted?

NO only non-compliant identifiers

ssarrafan commented 3 months ago

@mbthornton-lbl will be continuing to work on this in the next sprint per Slack message. Moving over.

microbiomedata / nmdc-schema

Re-ID Studies in Napa, verify schema compliance, and ensure pre-requisites for Prod Re-ID #1807