microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

populate GraphDB from https://api-napa.microbiomedata.org/ #1302

Closed turbomam closed 2 months ago

turbomam commented 11 months ago

@aclum what collections do you want to include?

turbomam commented 11 months ago

Against what version of the schema are those data valid?

@eecavanna did the dump around September 28th

See Slack thread: https://nmdc-group.slack.com/archives/C05RNEJAV38/p1695864234712159

turbomam commented 11 months ago

https://pypi.org/project/nmdc-schema/8.0.0/ from September 21st?

turbomam commented 11 months ago

try https://github.com/microbiomedata/nmdc-schema/blob/v8.0.0/nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml

Raw

https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v8.0.0/nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml

turbomam commented 11 months ago

✗ Additional properties are not allowed ('sample_mass' was unexpected) in $.extraction_set[0]

https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v7.8.0/nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml ???

aclum commented 11 months ago

study_set,biosample_set, omics_processing, subclasses of WorkflowExecutionActivity, data_object_set, functional_annotation_agg. extraction_set records are not needed since those only existed after napa-style ids

turbomam commented 11 months ago

@aclum please confirm and see my notes

Yes, dump

No, dumping not required

turbomam commented 11 months ago

validating against v7.8.0 also fails due to illegal upper-case GOLD: prefixes

aclum commented 11 months ago

I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.

turbomam commented 11 months ago

Finishing thought about collections to be migrated:

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nmdc_data: <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017>
select distinct ?collection_name
where {
    graph nmdc:nmdc {
        ?mixin rdfs:domain nmdc:Database .
        ?database_slot rdfs:subPropertyOf ?mixin .
        ?database_slot rdfs:range ?slot_range .
        ?slot_range rdfs:subClassOf nmdc:WorkflowExecutionActivity .
    }
    graph nmdc_data: {
        ?x ?database_slot ?z .
    }
    bind(strafter(str(?database_slot), "https://w3id.org/nmdc/") as ?collection_name)
}
order by ?database_slot
turbomam commented 11 months ago

I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.

regenerating validation log now

turbomam commented 11 months ago

Explicitly excluding these routinely-dumped collections

These are small, so including them wouldn't take much longer or make the output files much larger.

--selected-collections extraction_set \
--selected-collections field_research_site_set \
--selected-collections library_preparation_set \
--selected-collections pooling_set \
--selected-collections processed_sample_set \
turbomam commented 11 months ago

@aclum yes, you did correct all GenomesOnLine Database prefixes to "gold:", in accordance with the more recent schema releases.

But nmdc-schema v7.8.0 (which otherwise matches the frozen Napa MongoDB) expected the following slots to use the following patterns

I think I can do a quick and dirty fix for this.

turbomam commented 11 months ago

Fixed and validated. Will be available in GraphDB tomorrow.

aclum commented 11 months ago

thanks @turbomam

turbomam commented 11 months ago

I put the Napa squad's MongoDB contents into a separate GraphDB repository (which is like a Postgres database? or schema?) called napa-graph. You will need to select that from the repository pull-down on the top right of most GraphDB pages

image

Here are the currently populated named graphs:

If that new name for the data graph is disruptive, then we can temporarily change it. The old data named graph was based on the MongoDB connection string (mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017). Some systems and tools I like have been complaining about that so I would prefer to use the API's URL instead.

turbomam commented 11 months ago

See also

aclum commented 2 months ago

re-iding is done, we can close this.