microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Re-ID Studies in Napa, verify schema compliance, and ensure pre-requisites for Prod Re-ID #1807

Closed mbthornton-lbl closed 2 months ago

mbthornton-lbl commented 4 months ago

Test Runs on Napa instance and pre-requisites for Metagenomic workflows. Note: Context for this test run assumes the the following updates have been applied to the testDB instance:

https://github.com/microbiomedata/issues/issues/532

Re-ID Studies:

  1. "Stegen": nmdc:sty-11-aygzgv51, formerly gold:Gs0114663
  2. "SPRUCE": nmdc:sty-11-33fbta56, formerly gold:Gs0110138
  3. "EMP": nmdc:sty-11-547rwq94, formerly gold:Gs0154244
  4. "Luquillo": nmdc:sty-11-076c9980, formerly gold:Gs0128850
  5. "CrestedButte" nmdc:sty-11-dcqce727, formerly gold:Gs0135149
  6. "DeepShale" nmdc:sty-11-8fb6t785, formerly gold:Gs0114675
  7. "Populus": nmdc:sty-11-1t150432, formerly gold:Gs0103573

Pre-Requisites - complete ETL on Napa instance and verify:

Pre-requisites - All ETL recipes fully reproducable:

Study, BioSample and Omics:

Metagenomics:

mbthornton-lbl commented 4 months ago

SPARQL query for Orphaned DataObjects:

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# Orphan DataObjects - not object of has_input or has_output
select * where { 
    ?do a nmdc:DataObject .
    minus {
        ?o nmdc:has_input ?do .
    }
    minus {
        ?o nmdc:has_output ?do .
    }
} limit 100 
aclum commented 4 months ago

I reopened SPRUCE but this likely pertains to all the studies. We are missing deleting some of the binning data object records. ie {'description':{$regex:/Gp0208377/}} from the SPRUCE example data object type are Metagenome Bins,CheckM Statistics or null. The null ones, based on this example, could be captured by a case insensitive search for metabat2 on slot description

mbthornton-lbl commented 4 months ago

@aclum Are we deleting all Binning data objects, or only those with a non-compliant ID? Should records like this one:

record: nmdc:dobj-11-qm3fbt63 CheckM Statistics CheckM for nmdc:wfmag-11-m0t5hc17.1

be deleted?

NO only non-compliant identifiers

ssarrafan commented 3 months ago

@mbthornton-lbl will be continuing to work on this in the next sprint per Slack message. Moving over.