Closed turbomam closed 2 months ago
Against what version of the schema are those data valid?
@eecavanna did the dump around September 28th
See Slack thread: https://nmdc-group.slack.com/archives/C05RNEJAV38/p1695864234712159
https://pypi.org/project/nmdc-schema/8.0.0/ from September 21st?
✗ Additional properties are not allowed ('sample_mass' was unexpected) in $.extraction_set[0]
study_set,biosample_set, omics_processing, subclasses of WorkflowExecutionActivity, data_object_set, functional_annotation_agg. extraction_set records are not needed since those only existed after napa-style ids
@aclum please confirm and see my notes
validating against v7.8.0 also fails due to illegal upper-case GOLD: prefixes
I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nmdc_data: <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017>
select distinct ?collection_name
where {
graph nmdc:nmdc {
?mixin rdfs:domain nmdc:Database .
?database_slot rdfs:subPropertyOf ?mixin .
?database_slot rdfs:range ?slot_range .
?slot_range rdfs:subClassOf nmdc:WorkflowExecutionActivity .
}
graph nmdc_data: {
?x ?database_slot ?z .
}
bind(strafter(str(?database_slot), "https://w3id.org/nmdc/") as ?collection_name)
}
order by ?database_slot
I thought I fixed all the gold uppercasing in an ad hoc fashion. Do you have an example record. I can fix.
regenerating validation log now
These are small, so including them wouldn't take much longer or make the output files much larger.
--selected-collections extraction_set \
--selected-collections field_research_site_set \
--selected-collections library_preparation_set \
--selected-collections pooling_set \
--selected-collections processed_sample_set \
@aclum yes, you did correct all GenomesOnLine Database prefixes to "gold:", in accordance with the more recent schema releases.
But nmdc-schema v7.8.0 (which otherwise matches the frozen Napa MongoDB) expected the following slots to use the following patterns
gold_biosample_identifiers
: '^GOLD:Gb[0-9]+$'gold_sequencing_project_identifiers
: '^GOLD:Gp[0-9]+$'gold_study_identifiers
: '^GOLD:Gs[0-9]+$'I think I can do a quick and dirty fix for this.
Fixed and validated. Will be available in GraphDB tomorrow.
thanks @turbomam
I put the Napa squad's MongoDB contents into a separate GraphDB repository (which is like a Postgres database? or schema?) called napa-graph. You will need to select that from the repository pull-down on the top right of most GraphDB pages
Here are the currently populated named graphs:
If that new name for the data graph is disruptive, then we can temporarily change it. The old data named graph was based on the MongoDB connection string (mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017). Some systems and tools I like have been complaining about that so I would prefer to use the API's URL instead.
re-iding is done, we can close this.
@aclum what collections do you want to include?