populate GraphDB from https://api-napa.microbiomedata.org/

study_set,biosample_set, omics_processing, subclasses of WorkflowExecutionActivity, data_object_set, functional_annotation_agg. extraction_set records are not needed since those only existed after napa-style ids

turbomam commented 11 months ago

@aclum please confirm and see my notes

Yes, dump

biosample_set
data_object_set
omics_processing_set
study_set
all collections for the subclasses of WorkflowExecutionActivity,

No, dumping not required

functional_annotation_agg (MAM: I didn't have any plans in general to dump, migrate, validate or convert this becasue it is so large)
extraction_set (not needed since it only existed after napa-style ids)

turbomam commented 11 months ago

validating against v7.8.0 also fails due to illegal upper-case GOLD: prefixes

aclum commented 11 months ago

I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.

turbomam commented 11 months ago

Finishing thought about collections to be migrated:

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nmdc_data: <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017>
select distinct ?collection_name
where {
    graph nmdc:nmdc {
        ?mixin rdfs:domain nmdc:Database .
        ?database_slot rdfs:subPropertyOf ?mixin .
        ?database_slot rdfs:range ?slot_range .
        ?slot_range rdfs:subClassOf nmdc:WorkflowExecutionActivity .
    }
    graph nmdc_data: {
        ?x ?database_slot ?z .
    }
    bind(strafter(str(?database_slot), "https://w3id.org/nmdc/") as ?collection_name)
}
order by ?database_slot

mags_activity_set
metabolomics_analysis_activity_set
metagenome_annotation_activity_set
metagenome_assembly_set
metagenome_sequencing_activity_set
metatranscriptome_activity_set
nom_analysis_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set

turbomam commented 11 months ago

I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.

regenerating validation log now

turbomam commented 11 months ago

Explicitly excluding these routinely-dumped collections

These are small, so including them wouldn't take much longer or make the output files much larger.

--selected-collections extraction_set \
--selected-collections field_research_site_set \
--selected-collections library_preparation_set \
--selected-collections pooling_set \
--selected-collections processed_sample_set \

turbomam commented 11 months ago

@aclum yes, you did correct all GenomesOnLine Database prefixes to "gold:", in accordance with the more recent schema releases.

But nmdc-schema v7.8.0 (which otherwise matches the frozen Napa MongoDB) expected the following slots to use the following patterns

gold_biosample_identifiers: '^GOLD:Gb[0-9]+$'
gold_sequencing_project_identifiers: '^GOLD:Gp[0-9]+$'
gold_study_identifiers: '^GOLD:Gs[0-9]+$'

I think I can do a quick and dirty fix for this.

turbomam commented 11 months ago

Fixed and validated. Will be available in GraphDB tomorrow.

aclum commented 11 months ago

thanks @turbomam

turbomam commented 11 months ago

I put the Napa squad's MongoDB contents into a separate GraphDB repository (which is like a Postgres database? or schema?) called napa-graph. You will need to select that from the repository pull-down on the top right of most GraphDB pages

Here are the currently populated named graphs:

https://api-dev.microbiomedata.org (data)
https://w3id.org/nmdc/nmdc (schema)

If that new name for the data graph is disruptive, then we can temporarily change it. The old data named graph was based on the MongoDB connection string (mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017). Some systems and tools I like have been complaining about that so I would prefer to use the API's URL instead.

turbomam commented 11 months ago