microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Some schema elements lack a `description` slot #685

Open turbomam opened 1 year ago

turbomam commented 1 year ago

closes

We can use lint-linkml to find schema elements that are lacking basic textual annotations.

@mslarae13 could you please (when your schedule allows)

Class 'CollectingBiosamplesFromSite' does not have recommended slot 'description'  (recommended)
Slot 'emsl_project_identifier' does not have recommended slot 'description'  (recommended)
Slot 'biosample_categories' does not have recommended slot 'description'  (recommended)
Slot 'relevant_protocols' does not have recommended slot 'description'  (recommended)
Slot 'funding_sources' does not have recommended slot 'description'  (recommended)
Slot 'applied_role' does not have recommended slot 'description'  (recommended)
Slot 'applied_roles' does not have recommended slot 'description'  (recommended)
Slot 'applies_to_person' does not have recommended slot 'description'  (recommended)
Slot 'field_research_site_set' does not have recommended slot 'description'  (recommended)
Slot 'collecting_biosamples_from_site_set' does not have recommended slot 'description'  (recommended)
Slot 'doi' does not have recommended slot 'description'  (recommended)
Slot 'habitat' does not have recommended slot 'description'  (recommended)
Slot 'location' does not have recommended slot 'description'  (recommended)
Slot 'community' does not have recommended slot 'description'  (recommended)
Slot 'ncbi_taxonomy_name' does not have recommended slot 'description'  (recommended)
Slot 'ncbi_project_name' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_site' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_year' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_month' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_day' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_hour' does not have recommended slot 'description'  (recommended)
Slot 'sample_collection_minute' does not have recommended slot 'description'  (recommended)
Slot 'soluble_iron_micromol' does not have recommended slot 'description'  (recommended)
Slot 'host_name' does not have recommended slot 'description'  (recommended)
Slot 'subsurface_depth' does not have recommended slot 'description'  (recommended)
Slot 'proport_woa_temperature' does not have recommended slot 'description'  (recommended)
Slot 'biogas_temperature' does not have recommended slot 'description'  (recommended)
Slot 'soil_annual_season_temp' does not have recommended slot 'description'  (recommended)
Slot 'biogas_retention_time' does not have recommended slot 'description'  (recommended)
Slot 'completion_date' does not have recommended slot 'description'  (recommended)
Enum 'file type enum' does not have recommended slot 'description'  (recommended)
Enum 'credit enum' does not have recommended slot 'description'  (recommended)
Enum 'processing_institution_enum' does not have recommended slot 'description'  (recommended)
mslarae13 commented 10 months ago

@turbomam I missed this issue! But looks very important. I'll try to get to it this sprint. @ssarrafan

ssarrafan commented 10 months ago

@turbomam I missed this issue! But looks very important. I'll try to get to it this sprint. @ssarrafan

I'll add to the sprint and assign to you @mslarae13

turbomam commented 10 months ago

Are some MIxS terms (whose descriptions just got lost in the technical pipeline) and others nmdc-native terms for which we just never wrote descriptions?

turbomam commented 10 months ago

I probably created some of them like CollectingBiosamplesFromSite, field_research_site_set and collecting_biosamples_from_site_set

doi has been removed from nmdc-schema v7.8.0

turbomam commented 10 months ago

maybe mostly nmdc-native, like https://w3id.org/nmdc/soil_annual_season_temp and https://microbiomedata.github.io/nmdc-schema/proport_woa_temperature/

turbomam commented 10 months ago

I guess I should avoid scope creep... but I will say that the LinkML schema has some good hints for basic annotations that most nmdc-schema elements should have. Taking a look at https://linkml.io/linkml-model/docs/CommonMetadata/ might help the brainstorming process.

eecavanna commented 10 months ago
Class 'CollectingBiosamplesFromSite' does not have recommended slot 'description'  (recommended)
Slot 'emsl_project_identifier' does not have recommended slot 'description'  (recommended)
...

As of today (i.e. in nmdc-schema v7.8.0), all occurrences of emsl_project_identifier in the repo are commented out. That is shown in these search results.

I recommend re-running lint-linkml to generate a current list.

mslarae13 commented 10 months ago

I recommend re-running lint-linkml to generate a current list.

@turbomam can you help with this?

mslarae13 commented 10 months ago
ssarrafan commented 10 months ago

I see a lot of checkboxes unchecked so will move this to the next sprint. If it should go to the backlog let me know @mslarae13

mslarae13 commented 10 months ago

'habitat' is a GOLD slot. This is filled our for some samples that came in from GOLD. The 'habitat' slot is about the environment. GOLD doesn't appear to provide a definition? @aclum do you know otherwise? Does GOLD/JGI have a term dictionary or something somewhere I can't see?

If not, I recommend http://purl.obolibrary.org/obo/ENVO_01000739 An environmental system which can sustain and allow the growth of an ecological population.

Thoughts/feedback @turbomam @cmungall

mslarae13 commented 10 months ago

'location' is a GOLD slot. This is filled out for some of the sample that came from GOLD.

The problem is sometimes it matches geo_loc_name... sometimes it describes the location of the same (Populus endosphere)... sometimes it combines the 2 (groundwater-surface water interaction zone in Washington, USA)

This slot is all over the place. I don't think we use it on the Data Portal. Do we need it? The lack of clear descriptions / use from GOLD causes people to put in a variety of information, which isn't useful.

See https://github.com/microbiomedata/nmdc-schema/issues/1049

mslarae13 commented 10 months ago

'community' is from GOLD and consistently identifies what kind of genomic community was sequence. So, microbial community. However, microbial community vs communities is variable.

Do we fix this? Push back to GOLD? get more specific? how useful is this for NMDC, it should always be microbial. Or do we need this for viral? Also,

GOLD description: the community sampled (eg: microbial/bacterial, viral, archaeal, other)

turbomam commented 10 months ago

I have regenerated the lint report. Anyone who has Python 3.9+ and poetry installed can do that by

The output is a text file, local/lint.log. It reports violations for a couple of patterns in most schema elements. This issue is about missing descriptions, so we can ignore the others for now.

Here's a link for downloading the whole schema linting report. The "does not have recommended slot 'description' (recommended)" violations appear at the top.

We have the option of changing the rules that we want to enforce.

aclum commented 10 months ago

@mslarae13 if you open a example gold biosample there is help info on some of the fields. or some terms are defined as a link in their help menu

from https://gold.jgi.doe.gov/help Habitat | Natural environment of an organism or biosample; the place that is natural for the life and growth of an organism or a general description of the place where a biosample was collected from. E.g. Wetland, Human skin etc.

Reddy said he would email a table of what all the tool tips in GOLD are which we can use to populate the rest of the fields.

@turbomam I think Reddy gave you all the GOLD tables. Can you tell from which of their fields required controlled vocabulary for the values?

aclum commented 10 months ago

From tooltip in GOLD Location - sample location, e.g. black sea, etolikon lagoon, healthy adults, etc. Community - the community sampled, e.g. microbial/bacterial, viral, archaeal or other.

turbomam commented 10 months ago

Thanks @aclum ! The 'e.g.'s would go in the examples slot, not the description for these terms as implemented as nmdc-schema slots.

mslarae13 commented 10 months ago

Here's a link for downloading the whole schema linting report. The "does not have recommended slot 'description' (recommended)" violations appear at the top.

I think the list got longer!!!

mslarae13 commented 10 months ago

Updated List

turbomam commented 10 months ago

Yeah, longer. You/we can filter, prioritize, whatever.

mslarae13 commented 10 months ago

Slot 'funding_sources' is not in the new list? Is this surprising?

Discussed 09/18 : got a description. no update needed

turbomam commented 10 months ago

Following up on @mslarae13's insights aboutlocation:

We inherit this from GOLD. Is it worth including in our data objects? (Biosamples) This could be an opportunity for slimming down the Biosample class

If we retain these values, what added value for NMDC does that enable?

We should be linking to the GOLD records... users could peruse those links to see these loosely managed GOLD values.

The point of this issue is that location doesn't have any description in the nmdc-schema. One has to visit several sections of this page to find all GOLD descriptions: https://gold.jgi.doe.gov/help. That may even have different semantics from teh web site tool tips.

PREFIX nmdc: <https://w3id.org/nmdc/>
select distinct ?c ?o where { 
    ?s a ?c ;
    nmdc:location ?o .
}
turbomam commented 10 months ago

I have done another search for slots that don't have a description and have never been used in MongoDB. I'll attach the results as a CSV because it's kind long to include inline, but GitHub doesn't accept TSV attachments ?!

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX linkml: <https://w3id.org/linkml/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
select
distinct ?slot ?l
where {
    graph <https://w3id.org/nmdc/nmdc>
    {
        ?slot a linkml:SlotDefinition .
        optional {
            ?slot rdfs:label ?l
        }     minus {
            ?slot skos:definition ?d
        }
    }
    minus {
        graph <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017>
        {
            ?s ?slot ?o
        }
    }
}
order by ?slot

undesc-ununsed-slots.csv

mslarae13 commented 10 months ago

Thanks @turbomam ! That's a lot!

This isn't just biosample? What's nmdc:smiles ?!

mslarae13 commented 10 months ago

I just looked,

ChemicalEntity:

smiles: description: >- A string encoding of a molecular graph, no chiral or isotopic information. There are usually a large number of valid SMILES which represent a given structure. For example, CCO, OCC and C(O)C all specify the structure of ethanol. multivalued: true range: string

It does have a description. Not sure why it's on the list. But it's also in core.yaml. .... which I don't think we use? As in GSC core? Which is going away?

turbomam commented 10 months ago

It looks like slot_usages aren't making it into the OWL version of the schema, which is what I have been loading into the RDF triplestore database.

I don't know why we have slots whose descriptions come from a slot_usage rather than the global definition of the slot. I'm going to refactor all of that.

@cmungall has also encouraged me to familiarize myself with the RDF version of the schema, which could also be loaded into the triplestore. I tried earlier today but got an error.

mslarae13 commented 10 months ago

This wil continue into next sprint. Progress is being made and we've come up with some good paths for completion

mslarae13 commented 9 months ago

Still working on this. Non-squad activity that is just taking more time to complete.

ssarrafan commented 9 months ago

Not updated in the last 2 weeks. Removing from sprint and adding backlog label.