microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

conflicting gold paths in biosample and study #10

Closed dwinston closed 2 years ago

dwinston commented 3 years ago

Currently, both biosample and study include slots for the five GOLD path fields (ecosystem, ecosystem_category, ecosystem_type, ecosystem_subtype, specific_ecosystem):

This is an issue because it's not clear if all biosamples must have the same values as the study, if the biosample values "override" the "default" study values, etc.

It would be better to (a) have them exclusively either in study or in biosample, or (b) to name them differently so as to clarify, e.g. prefix the slots in study with "default", "assumed", etc.

@cmungall @dehays

dehays commented 3 years ago

I think they should be removed from study. @wdduncan - I am assuming that GOLD is associating them with both and that is how we have them on both NMDC study and biosample. With our current (very small) set of studies I believe there is probably correspondence but as soon as a study has multiple sampling events / sites this may not be true. I am not seeing the value of keeping them on study. (Studies have samples collected from environments. )

dehays commented 3 years ago

@cmungall @wdduncan If you agree - can we make this change to study slots

wdduncan commented 3 years ago

I agree, but perhaps we should check with Reddy and see if the GOLD paths at the study level are important?

wdduncan commented 3 years ago

Adding @tbkreddy to the conversation.

TBKReddy commented 3 years ago

Ecosytem classification paths between Study and Biosample doesn't present any conflict as such.

This is an issue because it's not clear if all biosamples must have the same values as the study, '

No. Studies with a broad scope may have the least common denominator but individual Biosmaples will have more specific classification assigned to them.

if the biosample values "override" the "default" study values, etc. No they don't. If there is a big study with samples from a variety of environments. The study classification may stop at a level to broadly represent all samples.

For example this study https://gold.jgi.doe.gov/study?id=Gs0047590 shows host associated -> microbial, the top two levels. If you look at the Biosamples under this study they list individual more specific classification terms.

It would be better to (a) have them exclusively either in study or in biosample, or (b) to name them differently so as to clarify, e.g. prefix the slots in study with "default", "assumed", etc.

You need to have these specifically on each Biosample to have them annotated. You can consider not having these at all on the study and instead rely on the values from the underlying Biosmaples to come up with a common denominator or a set of classification terms and use those either to display on the study or provide a search mechanism. If you prefer you can definitely go without having these defined on the Study.

cmungall commented 3 years ago

I can't speak to GOLD I see no problem with capturing environment at the level of study and sample, and with the values being consistent but different

The samples environments should be consistent with study environments (formally: the ENVO terms should be subsumed in the isa-partof graph)

There is a lot of value in capturing study environment before samples are collected. We should also capture which mixs variables are being studied. See:

https://docs.google.com/document/d/1fKv4XJ1HTm3XoS_E2NnKGCiZOpJwZnZwRcxUnNNYfiQ/edit

wdduncan commented 3 years ago

We have decided to keep them.

turbomam commented 2 years ago

@dwinston I share your concern about unconstrained relationships between the biosample and study GOLD path elements, but will defer to the guidance from @TBKReddy and @cmungall