legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

Expression data may map reads from Strain B to the genome annotation of Strain A #26

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

This is rearing its head w.r.t. the Cook Lab expression data that @adf-ncgr has generated using Nextflow. We have expression from seven "genotypes" (CDC_Consul, Genesis836, Hybrid, ICC14778, ICC8058, Rupali, Yorker) mapped to CDCFrontier.gnm3.ann1.

Currently those data would be housed in CDCFrontier.gnm3.ann1.expr.KEY4. While none of the reads are from CDCFrontier.

So we've got the old "one thing that is mapped to another thing" problem, which we solved for synteny with the _x_ notation.

We only have one README per collection, which presumes a single publication or contributor, so there is no way we can combine a variety of expression sets from various strains into one collection named only by the genome annotation the reads were mapped to, unless we rely on KEY4. That's another, related, issue with our current naming.

SO, I think we need to change how expression collections are named: indicating the strain that generated the reads on top of the genome annotation those reads are mapped to, with the option to use "mixed" in the case of a variety of strains:

or perhaps just

Not a gigantic tweak, but a tweak that would be applied to existing expression collections.

If we're concerned about the Doe Lab and the Jones Lab both generating expression sets for the same strain and genome annotation, we could consider appending authors to the collection name. But I don't think we'll hit that very much, and we can always use KEY4 to separate collections containing similar data.

In all cases we'd make the "genotype" column in the samples file required, and use it to create the Strain to which samples are related. The README genotype would refer to the genome annotation as usual.

I also think I'd like to simplify the samples file, which is currently inherited from the file that Sudhansu made up pre-Datastore from which I ignore many columns which are not sample-specific. I'll codify all that in a spec branch first for review, of course, not part of this issue, but heads up.

Feel free to chime in @cann0010 or just leave it to @adf-ncgr and me to arm-wrestle through.

Since we only have six loadable expression collections in the DS, this is a good time to improve our storage of expression data.

sammyjava commented 1 year ago

@adf-ncgr points out that this is a good candidate for the type of naming we do under /diversity/, namely, in this case,

Strain.ann.gnm.expr.Author1_Author2_YEAR

which handles everything. I'm cool with that. It doesn't give any indication of WHAT was mapped to Strain.ann.gnm, but that's what the README is for.

We should probably use Readme.genotype to list the sample genotypes. (They're also in the samples file, but that way the README-at-a-glance on the data host would show them.)

sammyjava commented 1 year ago

Just to flesh that out, the current expression collections would then be named:

ICPL87119.gnm1.ann1.expr.Pazhamala_Purohit_2017
ICC4958.gnm2.ann1.expr.Singh_Garg_2013
Wm82.gnm2.ann1.expr.Libault_Farmer_2010
A17_HM341.gnm4.ann2.expr.Benedito_Wang_2009
G19833.gnm1.ann1.expr.ORourke_Iniguez_2014
IT97K-499-35.gnm1.ann1.expr.Yao_Jiang_2016
adf-ncgr commented 1 year ago

works for me- I'll note that I hadn't remembered "vanWyk_DuPlessis_2014" but looks like this is just a couple of bam files so a little different than our normal expr dataset. Not sure what the intent was, but perhaps someone will be able to vouch for its non-conformity.

sammyjava commented 1 year ago

Right, that's not a loadable expression set, I forgot the phavu set to make six loadables.

StevenCannon-USDA commented 1 year ago

Looks OK to me too. Since this will impact @sdash-github, he should weigh in too.

sdash-github commented 1 year ago

GenomeStrain.ann.gnm.expr.SampleStrain.Key (" indicating the strain that generated the reads on top of the genome annotation those reads are mapped to") vs. GenomeStrain.ann.gnm.expr.Author1_Author2_YEAR.Key

I am going through the expression README files for the five species for two aspects: a.) confirming to our DS ways and b.) Immediate clarity to visitor/user (it is an issue now)

Question to Sam: Is it good to avoid or okay to have additional attributes to the README for the expression collection, for example, something like 'genotype_sample' and 'genotype_referenceGenome'?

Your answer will help me gear my comment on this issue.

sammyjava commented 1 year ago

In principle, we can add things to the READMEs, but it has to be done via the specifications and then I have to add them to the README loader for the mines. Any addition should be strongly motivated.

And note: the README should only contain info that pertains to the entire collection. We can list eight genotypes under the existing README.genotypes list (it is a list pertaining to the entire collection). But we also need to include the genotype column in the samples file to indicate the genotype (AKA Strain) of each sample.

Likewise, the samples file columns should not contain collection-wide attributes like BioProject and SRA study, which pertain to the entire collection and should be stored in the README, not the samples file.

Right now the README for expression collections has four expression-related additional attributes:

I'm not convinced we need more than that. The referenced genome is given by the name of the collection (e.g. Wm82.gnm2.ann1) as usual, and the sample genotypes should be listed in the samples file, and should also be listed under README.genotype.

I'm going to make a proposed file set today for the new Cicer data from Andrew (which has seven "genotypes", all differing from the genome they were mapped to), pretty similar to what we already have, as a concrete point of discussion. We can then decide what is missing and what needs to be moved around.

sdash-github commented 1 year ago

Understood README should contain collection attributes and not attributes of each sample.

I looked at the #27 (simplified samples file proposal) and this issue w.r.t a fundamental info of an expr dataset, "what samples have been mapped to what ref genome" should be immediately apparent.

Suggestion: Some way (any way) we should make it immediately apparent from the README without having to open another file for a basic piece of information without violating any DS requirements. This will be a user friendly feature.

We can:

Or, any other way of getting this info to the user without them having to dig deeper. I thought about 'description' attribute of readme but it has the potential of getting buried inside large amount of text but definitely better than opening a zipped file to get this information.

sammyjava commented 1 year ago

Suggestion: Some way (any way) we should make it immediately apparent from the README without having to open another file for a basic piece of information without violating any DS requirements. This will be a user friendly feature.

I think you missed my proposal that we use the normal genotype field in the README to list the genotypes, which is appropriate since the collection pertains to those genotypes:

---
identifier: CDCFrontier.gnm3.ann1.expr.mixed.Cook_Moenga_2018
scientific_name: Cicer arietinum
taxid: 3827
scientific_name_abbrev: cicar
genotype:
  - CDC_Consul
  - Genesis836
  - Hybrid
  - ICC14778
  - ICC8058
  - Rupali
  - Yorker
synopsis: "Chickpea gene expression from seven genotypes mapped to the CDCFrontier.gnm3.ann1 genes."
sdash-github commented 1 year ago

The dataset at https://data.legumeinfo.org/Cicer/arietinum/expression/CDCFrontier.gnm1.ann1.expr.C14F/ lists two genotypes in README. genotype:

2nd example.

In the dataset https://data.legumeinfo.org/Phaseolus/vulgaris/expression/G19833.gnm1.ann1.expr.4ZDQ/ the README lists genotype:

Here the genotype attribute valued G19833 is even more ambiguous because the reads come from the Meso American bean type Negro Jamapa and not from the Andean type G19833.

I am suggesting while you formalize the specifications these sort of ambiguities should be removed right from the start.

adf-ncgr commented 1 year ago

If I am following, I think the proposal is that we are making the use of genotype in expression READMEs consistent with its use in diversity READMEs, where it describes the samples analyzed and not the reference (unless one of the samples analyzed happens to be the same as the reference genotype, which probably happens more commonly in expression than in diversity). So @sdash-github is correct that those READMEs are incorrect according to what I think is @sammyjava proposal. So, we should fix them and try to make sure the specification is clear (though curators will probably always make mistakes).

sammyjava commented 1 year ago

Correct. I didn't catch that the phavu expression is from a different genotype than G19833, to be honest. So that README under my proposal (including name change) should be:

---
identifier: G19833.gnm1.ann1.expr.ORourke_Iniguez_2014

provenance: "LIS expression dataset phavu1 (Bean expression atlas Negro jamapa)."

scientific_name: Phaseolus vulgaris

taxid: 3885

bioproject: PRJNA210619

scientific_name_abbrev: phavu

genotype:
  - Negro jamapa

synopsis: "LIS expression dataset phavu1 (An RNA-Seq based gene expression atlas of the common bean cv. Negro jamapa)."

description: "LIS expression dataset phavu1 (An RNA-Seq based gene expression atlas of the common bean cv. Negro jamapa)."

expression_unit: TPM

local_file_creation_date: "2017-01-01"

publication_doi: 10.1186/1471-2164-15-866

publication_title: "O'Rourke JA, Iniguez LP, Fu F, Bucciarelli B, Miller SS, Jackson SA, McClean PE, Li J, Dai X, Zhao PX, Hernandez G, Vance CP. An RNA-Seq based gene expression atlas of the common bean. BMC Genomics. 2014 Oct 6;15:866."

license: open
sammyjava commented 1 year ago

As for the ICC4958 and CDCFrontier (the former of which is currently the only loadable expression collection of the two), same deal, both READMEs would have ICC4958 as the genotype entry.

sdash-github commented 1 year ago

I think I am a bit more clearer than what I started with.

As articulated by @adf-ncgr It would be useful in the long term if there is a place to mention this point that for expression collections in our DS the 'genotype' refers to that/those of the samples unlike the case in the diversity collections.

sammyjava commented 1 year ago

Yes, the place where we do that is the Datastore specification for expression which I'm proposing to update: https://github.com/legumeinfo/datastore-specifications/tree/main/Genus/species/expression

sammyjava commented 1 year ago

This was implemented with the DS spec update changing collections names to, e.g. G19833.gnm1.ann1.expr.Negro_jamapa.ORourke_Iniguez_2014