legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: "gene_functions" collection #40

Open StevenCannon-USDA opened 1 year ago

StevenCannon-USDA commented 1 year ago

I propose formats and methods for collecting and storing information about genes experimentally associated with phenotypes. See the description in the README and examples of the three file types in this datastore-specifications directory.

You can also see a few more examples, and the two associated scripts, in this repository (which will go away once the RFO is settled).

A few comments about my objectives and philosophy behind the specification:

sammyjava commented 1 year ago

The references block contains one or more blocks of citations, each containing three key-value pairs: "citation", "doi", and "pmid". Of these, either the pmid or doi is required (some publications lack a pmid, but all should have a doi). The citation should be in one of the following forms (depending on whether there are one, two, or three-or-more authors):

Let's make DOI required, since it is in the other READMEs and I use DOI to fill out the Publication object. PMID must be optional, of course. There are some older papers that don't have DOIs, and I say let's not cite them.

This is because folks forget to put the DOI in. If it's optional, then it doesn't fail validation.

StevenCannon-USDA commented 1 year ago

The journal I come across frequently that lacks PMID is Crop Science. But I'm fine with requiring DOI and making PMID optional.

StevenCannon-USDA commented 1 year ago

I'd like to add an optional key, "phenotype_description", to hold a free-text brief description of the phenotype described by the gene_function record. Examples:

phenotype_description: fragrant seeds
phenotype_description: Red-brown seed coat color
phenotype_description: Early flowering
phenotype_description: photoperiod insensitivity to short day conditions
sammyjava commented 1 year ago

So those are in addition to, but not linked to in any way, the ontology terms. I'd argue that any specific "phenotype description" should be associated with an ontology term, such as:

  - entity_name: flowering time
    entity: TO:0002616
    phenotype_description: Early flowering
  - entity_name: days to maturity
    entity: TO:0000469
    phenotype_description: Days from planting to 10 inch seedling height
  - entity_name: seed coat color
    entity: TO:0000190
    phenotype_description: Red-brown seed coat color

Otherwise, they're just orphaned text attributes that don't link to anything higher up.

(And, reminder, the spec needs to be updated to put relations with the entities that they refer to. Order doesn't have meaning in YAML.)

StevenCannon-USDA commented 1 year ago

A single "phenotype_description" key-value pair, to hold the human-readable gestalt description. These may sometimes be fairly complex, whereas the ontology terms are "pointillistic" and often difficult to select appropriately. The phehotype_description would, indeed, be orphaned relative to the atomic ontology terms. Here are some examples from some work-in-progress:

phenotype_description: Small and nonfunctional nodules arrested in growth when both normally spliced and alternatively spliced variants repressed.  When only the alternative spliced form repressed the nodules are small but still fix nitrogen successfully.
traits:
  - entity_name: root nodule morphology trait
    entity: TO:0000898
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Chen, Liu, et al., 2015
    doi: 10.3389/fpls.2015.00575
    pmid: 26284091
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204
phenotype_description: Doesn't make nodules; infection thread aborts
traits:
  - entity_name: root nodule number
    entity: TO:0000900
  - entity_name: root system
    entity: PO:0025025
  - entity_name: root nodule
    entity: PO:0003023
references:
  - citation: Herrbach, Chirinos, et al., 2017
    doi: 10.1093/jxb/erw474
    pmid: 28073951
  - citation: Oellrich, Walls et al., 2015
    doi: 10.1186/s13007-015-0053-y
    pmid: 25774204
sammyjava commented 1 year ago

Ahh, OK, so a single YAML has a single phenotype_description which is therefore associated with all the listed traits. Gotcha. Kinda like a description or summary.

StevenCannon-USDA commented 1 year ago

@sammyjava - right. So maybe "phenotype_summary" conveys the idea better.

sammyjava commented 1 year ago

Well sometimes we have a summary "Doesn't make nodules; infection thread aborts" and a longer description that describes the measurement, e.g. "Nodule formation was inspected using a confocal microscope; if fewer than 10 nodules are present on an full root strand then the phenotype is defined as Doesn't make nodules." (I'm sure I got that wrong, but you get the idea.)

Something to consider since you're adding in bespoke trait attributes.

StevenCannon-USDA commented 1 year ago

Brevity is a virtue.

StevenCannon-USDA commented 1 year ago

Sorry: for continuity with other READMEs, let's make it "phenotype_synopsis" rather than "...description" or "...summary". I'll make it so.

adf-ncgr commented 1 year ago

would it make sense to associate the phenotype in this sense with the reference that described it? Just thinking that the specifics of the phenotype in this sense will depend on the type of mutation of the gene (induced knockout/overexpression/natural variation) in which deviation from wild-type is observed. In any case, presumably such a description is derived from specific reference, but if it would be a synthesis across several that we don't plan to tie to specific alleles, then top-level as you have suggestion is appropriate. Just something to consider.

StevenCannon-USDA commented 1 year ago

would it make sense to associate the phenotype in this sense with the reference that described it

It would - but at the cost of more "method and protocol". We would end up doing it wrong or inconsistently. Overall, my preference is to try to keep things simple where possible.

Somewhat relatedly: one of my take-aways from the pain of this paper ... Oellrich et al., 2015(url) ... is that ontologies are cumbersome and difficult to apply well, difficult to compose into meaningful "sentences," etc. So, I'll encourage focusing on the entities (anatomy or trait terms) and discourage use of relation and quality terms. I am revising the README now, and will write a protocols document.

sammyjava commented 1 year ago

Yeah, FWIW we only have regular terms associated with stuff in the mines, not quality or relation terms. The ontologies themselves have their heirarchy, of course, but I just find a term that goes with a trait and if it's up- or down- or whatever I don't add that. Every term is standalone, they are not linked.