Open turbomam opened 1 year ago
@turbomam I missed this issue! But looks very important. I'll try to get to it this sprint. @ssarrafan
@turbomam I missed this issue! But looks very important. I'll try to get to it this sprint. @ssarrafan
I'll add to the sprint and assign to you @mslarae13
Are some MIxS terms (whose descriptions just got lost in the technical pipeline) and others nmdc-native terms for which we just never wrote descriptions?
I probably created some of them like CollectingBiosamplesFromSite
, field_research_site_set
and collecting_biosamples_from_site_set
doi
has been removed from nmdc-schema v7.8.0
maybe mostly nmdc-native, like https://w3id.org/nmdc/soil_annual_season_temp and https://microbiomedata.github.io/nmdc-schema/proport_woa_temperature/
I guess I should avoid scope creep... but I will say that the LinkML schema has some good hints for basic annotations that most nmdc-schema elements should have. Taking a look at https://linkml.io/linkml-model/docs/CommonMetadata/ might help the brainstorming process.
Class 'CollectingBiosamplesFromSite' does not have recommended slot 'description' (recommended) Slot 'emsl_project_identifier' does not have recommended slot 'description' (recommended) ...
As of today (i.e. in nmdc-schema
v7.8.0), all occurrences of emsl_project_identifier
in the repo are commented out. That is shown in these search results.
I recommend re-running lint-linkml
to generate a current list.
I recommend re-running
lint-linkml
to generate a current list.
@turbomam can you help with this?
I see a lot of checkboxes unchecked so will move this to the next sprint. If it should go to the backlog let me know @mslarae13
'habitat' is a GOLD slot. This is filled our for some samples that came in from GOLD. The 'habitat' slot is about the environment. GOLD doesn't appear to provide a definition? @aclum do you know otherwise? Does GOLD/JGI have a term dictionary or something somewhere I can't see?
If not, I recommend http://purl.obolibrary.org/obo/ENVO_01000739 An environmental system which can sustain and allow the growth of an ecological population.
Thoughts/feedback @turbomam @cmungall
'location' is a GOLD slot. This is filled out for some of the sample that came from GOLD.
The problem is sometimes it matches geo_loc_name... sometimes it describes the location of the same (Populus endosphere)... sometimes it combines the 2 (groundwater-surface water interaction zone in Washington, USA)
This slot is all over the place. I don't think we use it on the Data Portal. Do we need it? The lack of clear descriptions / use from GOLD causes people to put in a variety of information, which isn't useful.
See https://github.com/microbiomedata/nmdc-schema/issues/1049
'community' is from GOLD and consistently identifies what kind of genomic community was sequence. So, microbial community. However, microbial community vs communities is variable.
Do we fix this? Push back to GOLD? get more specific? how useful is this for NMDC, it should always be microbial. Or do we need this for viral? Also,
GOLD description: the community sampled (eg: microbial/bacterial, viral, archaeal, other)
I have regenerated the lint report. Anyone who has Python 3.9+ and poetry installed can do that by
poetry install
or poetry update
if necessarymake lint
The output is a text file, local/lint.log
. It reports violations for a couple of patterns in most schema elements. This issue is about missing descriptions, so we can ignore the others for now.
Here's a link for downloading the whole schema linting report. The "does not have recommended slot 'description' (recommended)" violations appear at the top.
We have the option of changing the rules that we want to enforce.
@mslarae13 if you open a example gold biosample there is help info on some of the fields. or some terms are defined as a link in their help menu
from https://gold.jgi.doe.gov/help Habitat | Natural environment of an organism or biosample; the place that is natural for the life and growth of an organism or a general description of the place where a biosample was collected from. E.g. Wetland, Human skin etc.
Reddy said he would email a table of what all the tool tips in GOLD are which we can use to populate the rest of the fields.
@turbomam I think Reddy gave you all the GOLD tables. Can you tell from which of their fields required controlled vocabulary for the values?
From tooltip in GOLD Location - sample location, e.g. black sea, etolikon lagoon, healthy adults, etc. Community - the community sampled, e.g. microbial/bacterial, viral, archaeal or other.
Thanks @aclum ! The 'e.g.'s would go in the examples
slot, not the description
for these terms as implemented as nmdc-schema slots.
Here's a link for downloading the whole schema linting report. The "does not have recommended slot 'description' (recommended)" violations appear at the top.
I think the list got longer!!!
Updated List
Yeah, longer. You/we can filter, prioritize, whatever.
Slot 'funding_sources' is not in the new list? Is this surprising?
Discussed 09/18 : got a description. no update needed
Following up on @mslarae13's insights aboutlocation
:
We inherit this from GOLD. Is it worth including in our data objects? (Biosample
s) This could be an opportunity for slimming down the Biosample class
If we retain these values, what added value for NMDC does that enable?
We should be linking to the GOLD records... users could peruse those links to see these loosely managed GOLD values.
The point of this issue is that location
doesn't have any description in the nmdc-schema. One has to visit several sections of this page to find all GOLD descriptions: https://gold.jgi.doe.gov/help. That may even have different semantics from teh web site tool tips.
PREFIX nmdc: <https://w3id.org/nmdc/>
select distinct ?c ?o where {
?s a ?c ;
nmdc:location ?o .
}
I have done another search for slots that don't have a description and have never been used in MongoDB. I'll attach the results as a CSV because it's kind long to include inline, but GitHub doesn't accept TSV attachments ?!
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX linkml: <https://w3id.org/linkml/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
select
distinct ?slot ?l
where {
graph <https://w3id.org/nmdc/nmdc>
{
?slot a linkml:SlotDefinition .
optional {
?slot rdfs:label ?l
} minus {
?slot skos:definition ?d
}
}
minus {
graph <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017>
{
?s ?slot ?o
}
}
}
order by ?slot
Thanks @turbomam ! That's a lot!
This isn't just biosample? What's nmdc:smiles ?!
I just looked,
ChemicalEntity:
smiles: description: >- A string encoding of a molecular graph, no chiral or isotopic information. There are usually a large number of valid SMILES which represent a given structure. For example, CCO, OCC and C(O)C all specify the structure of ethanol. multivalued: true range: string
It does have a description. Not sure why it's on the list. But it's also in core.yaml. .... which I don't think we use? As in GSC core? Which is going away?
It looks like slot_usage
s aren't making it into the OWL version of the schema, which is what I have been loading into the RDF triplestore database.
I don't know why we have slots whose description
s come from a slot_usage rather than the global definition of the slot. I'm going to refactor all of that.
@cmungall has also encouraged me to familiarize myself with the RDF version of the schema, which could also be loaded into the triplestore. I tried earlier today but got an error.
This wil continue into next sprint. Progress is being made and we've come up with some good paths for completion
Still working on this. Non-squad activity that is just taking more time to complete.
Not updated in the last 2 weeks. Removing from sprint and adding backlog label.
closes
661
We can use
lint-linkml
to find schema elements that are lacking basic textual annotations.@mslarae13 could you please (when your schedule allows)