microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

additional optional attributes to be added to study #51

Closed dehays closed 3 years ago

dehays commented 3 years ago

part of decomposition of #41

To be added to study entity as optional attributes. These are not currently available from GOLD studies and therefore cannot presently be populated from the GOLD -> NMDC ETL

wdduncan commented 3 years ago

@dehays every study object has a name attribute. Is proposal name different than the study's name?

wdduncan commented 3 years ago

@dehays study already has a doi attribute:

study:
    is_a: named thing
    in_subset: 
      - sample subset
    aliases: ['proposal', 'research proposal', 'research study', 'investigation']
    description: >-
      A study summarizes the overall goal of a research initiative and outlines the key objective of its underlying projects.  
    slots:
      - ecosystem
      - ecosystem_category
      - ecosystem_type
      - ecosystem_subtype
      - specific_ecosystem
      - principal investigator name
      - doi

But, the range of doi is an attribute value; e.g.:

"doi":  { "has_raw_value": "10.25585/1487764" }

Do we need doi to be a list (i.e., multi-valued), or will a study only have one doi?

dehays commented 3 years ago

@wdduncan On GOLD study name vs proposal name - I want to touch base with Emiley and Jeff with what we do here. What I see is the the UI is using the proposal name Stegen Study - notice the name displayed which is not something you are getting from GOLD (The name in the study entity is set to the GOLD study name, for example: "Groundwater microbial communities from the Columbia River, Washington, USA". I think we are going to need to add another name - not sure proposal name should be the key, perhaps display_name. But since it is not coming from GOLD it would need to be optional for the ETL to produce valid output.

I think a study would have a single DOI and as you point out, there is a slot for that. Publication DOIs would need to be a list.

wdduncan commented 3 years ago

Thanks @dehays
I'm a bit partial to naming the attribute display name. It seems to fit the purpose you describe.

What do you think @cmungall ?

cmungall commented 3 years ago

GOLD:

https://gold.jgi.doe.gov/study?id=Gs0114663

MIxS:

MIxS standardizes fields for samples not studies, but it doesn't follow normal form and does have repeated investigation variables such as project_name. project_name is underspecified, and if we look at existing values in INSDC they are all over the place. It ranges from "16S" to proper titles.

NCBI BioProject:

the study is broken into multiple projects 1 per sample, with identical metadata

https://www.ncbi.nlm.nih.gov//bioproject/PRJNA367315 ... https://www.ncbi.nlm.nih.gov//bioproject/PRJNA367318

Groundwater microbial communities from the Columbia River, Washington, USA - GW-RW S3_40_50 metagenome

Coupling Microbial Communities to Carbon and Contaminant Biogeochemistry in the Groundwater-Surface Water Interaction Zone

Relevance: Environmental

(Remediation and Carbon cycle seem to have been dropped)

NMDC:

https://data.microbiomedata.org/details/study/gold:Gs0114663

Coupling Microbial Communities to Carbon and Contaminant Biogeochemistry in the Groundwater-Surface Water Interaction Zone Description A metagenomic study to couple microbial communities to carbon and contaminant biogeochemistry in the groundwater-surface water interaction zone

Scientific objective To understand and predict the effects of variable groundwater-surface water mixing on microbial communities and, in turn, biogeochemical rates under the Subsurface Biogeochemical Research-Science Focus Area (SBR-SFA).

Not sure where objective is coming from, it's not in the schema or the json? Hardcoded in the UI?

My proposal:

TBD: should we have separate fields for doi, ark, etc, or just a generic citation field that takes a CURIE or PURL? My pref is the latter

wdduncan commented 3 years ago

I agree with @cmungall that title addresses the need to distinguish between what we get out of GOLD and what the study is called from the perspective of a funding agency.

A related issue to this is the long names we get out of gold for biosamples. I'm not so sure that title is the appropriate slot for shortening the GOLD biosample names, although we can use it for such a purpose. It might be better to have a display name slot, or perhaps we make use of alternative description.

We can also make use of an other_names slot, although I prefer to call it alternative names (this seems more consistent with alternative identifier and alternative description).

If we follow the suggestion of have a generic citation field to cover doi, ark, etc what do call it? citations? Also, do we need to create a citation object so that we specify if it is a doi etc.? We can do this, but I am unsure if the cost of doing so is worth the effort.

wdduncan commented 3 years ago

@dehays I've added the following slots to nmdc:study:

I think/hope this should cover Kitwares needs for displaying the name of a study.

dehays commented 3 years ago

That would seem to cover 'proposal name' from the list I'd started with. Also checked off DOI as studies already have a lost for that.

Not your responsibility - but after populating those title and alternative title fields - will need to have Kitware understand which to display.

wdduncan commented 3 years ago

Update after metadata call: There is a need to associate studies with websites. Should I add a websites slot, or make use of the existing url slot in core.yaml:

url:
    is_a: attribute
    range: string

update: I'll make a websites slot (for now).

wdduncan commented 3 years ago

publication DOIs is simply names publications.