Add gene identifier column to pan-gene set .hsh file?

sammyjava commented 1 year ago

A conversation amongst NCGR legumistas concluded that it would be nice to have gene identifiers in the pan-gene set collections. This would be easily enabled with a gene identifier column in the .hsh file.

StevenCannon-USDA commented 1 year ago

@sammyjava Can you give an example? Here is the status quo:

zcat < Glycine.pan4.RK4P.hsh.tsv.gz | head
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261700.1
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262300.1
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G263500.1
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G264200.1
Glycine.pan4.pan00001   glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_06G281800.1
Glycine.pan4.pan00001   glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G278500.1
Glycine.pan4.pan00001   glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G279500.1
Glycine.pan4.pan00001   glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_06G280300.1

sammyjava commented 1 year ago

Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100
Glycine.pan4.pan00001   glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1  glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900

Simple in this example but we have cases where the transcript identifiers are NOT a number appended to the gene identifier, like

Arachis.pan2.pan00001   arahy.BaileyII.gnm1.ann1.IDmodified-mrna-2217   arahy.BaileyII.gnm1.ann1.IDmodified-gene-2104

sammyjava commented 1 year ago

Note: this isn't needed for mine loading, since I already know which genes go with which transcripts from the GFFs. It's more about ease of use for users, such as the GCV.

StevenCannon-USDA commented 1 year ago

I see. Will need to think about how to do it robustly.

sammyjava commented 1 year ago

Yeah, I think the only way right now is to use the transcript-gene child-parent relation in the GFF. I am updating my mine loader to do that properly in a post-processor (just load transcripts from the pan-gene sets and then find the genes and proteins in post).

adf-ncgr commented 1 year ago

FWIW, when I need to do something similar in the context of gene family assignments, I extract a transcript2gene lookup from the gff (since mRNA records contain both their ID and the Parent attribute, it is pretty straightforward, at least assuming the order of ID and Parent attributes doesn't get switched around...).

sammyjava commented 1 year ago

I'm curious as to what "extract a transcript2gene lookup" means. Does that refer to a program, like gffread? Or something you've hacked?

adf-ncgr commented 1 year ago

Completely the latter. From a comment to one of the scripts that uses such a thing:

#if you need a transcript2gene lookup, something like this should suffice
#zgrep '  mRNA    ' medtr.HM004.gnm1.ann1.2XTB.gene_models_main.gff3.gz | sed 's/.*ID=\([^;]*\).*Parent=\([^;]*\).*/\1\t\2/' > transcript2gene

StevenCannon-USDA commented 1 year ago

Having given this some more thought ...

This would be easily enabled with a gene identifier column in the .hsh file.

My preference is to do it as a separate transcript-gene hash file. Two reasons: first, the pangene "hsh.tsv" file is the end product of a lengthy process, and I don't want to carry the extra information (gene ID) through that process; and second, I think of a hash file as having two fields (definitionally).

Admittedly, neither of these reasons is particularly strong -- but I don't think of any good reason not to store the transcript-gene mapping in a separate file.

Implementation: I propose re-generating all BED files in the Data Store annotations, to have seven columns: molecule, feature-start, feature-end, mRNA-ID, score(0), strand, gene-ID The seventh column (gene-ID) would be nonstandard for BED, but would be an innocuous addition I think. It would be the per-specification BED6, with an extra seventh column holding the gene-ID.

This would have the advantage of making another place to find this mapping easily (it would become one of the standard files in an annotation set); and it would also allow regularizing the BED files in the Data Store. They are currently without standard names:

find . -name "*bed.gz" | perl -pe 's/.+\.(\w+\.bed)\.gz/$1/' | sort | uniq -c
  67 cds.bed
   1 gene_models_lowqual.bed
  54 gene_models_main.bed
   1 transdec_gff3.bed

In the update, all would be gene_models_main.bed

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.

sammyjava commented 1 year ago

No opinion here. Like I said, I use the annotation GFFs to connect transcripts to genes (and proteins, when they have the same identifier), so the extra data in the pan-gene set collections isn't needed for mine loading, but I do get that people like to see gene names pretty much everywhere.

adf-ncgr commented 1 year ago

@StevenCannon-USDA I have no objection to augmenting the bed files with genes and standardizing naming. But when you say

Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files. does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too? Because as @sammyjava says "people like to see gene names pretty much everywhere"- and I just know you want to store a reference to an array as values in your hashes... ;)

Seriously, though, on analogy with the gene family assignment files I could at least imagine having some metadata about a gene's inclusion in a pangene set and storing it there (similar to how we store goodness of match to the HMM in the gfa files). Some off-the-cuff examples of such metadata that come to mind (none of them very compelling, but maybe could inspire some other thoughts about it)

whether a given gene was chosen as the exemplar (of course, this also appears in the fastas)
the transcript length (obviously available elsewhere but possibly of use for assessing allelic variation at a very coarse level)
a hash based on the sequence of the transcript and protein sequences (again, a coarse way of representing whether there is some sort of sequence variation among the member sequences)

pretty much just thinking out loud there, and don't feel strongly about any of it.

StevenCannon-USDA commented 1 year ago

does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too?

Yes. Something like the following (contrived, but with real data).

for file in *bed.gz; do zcat $file | head -1 | cut -f4,7; done
arath.Col0.gnm9.ann11.AT2G44175.1   arath.Col0.gnm9.ann11.AT2G44175
glyma.Wm82.gnm4.ann1.Glyma.20G155500.1  glyma.Wm82.gnm4.ann1.Glyma.20G155500
medtr.A17_HM341.gnm4.ann2.Medtr3g070500.3   medtr.A17_HM341.gnm4.ann2.Medtr3g070500
phavu.G19833.gnm2.ann1.Phvul.005G131400.1   phavu.G19833.gnm2.ann1.Phvul.005G131400
pissa.Cameor.gnm1.ann1.Psat0s798g0080.1 pissa.Cameor.gnm1.ann1.Psat0s798g0080
prupe.Lovell.gnm2.ann1.Prupe.7G026800.1 prupe.Lovell.gnm2.ann1.Prupe.7G026800
sento.Myeongyun.gnm1.ann1.Sto13g434570  sento.Myeongyun.gnm1.Sto13g434570
singl.CAF01.gnm1.ann1.evm_model_Chr03_3668  singl.CAF01.gnm1.ann1.evm_TU_Chr03_3668
vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150.1   vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150
vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070.1    vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070

In this sample, all are trivial except Sindora (singl). As you point out, it needn't be restricted to two columns - though this example is. It could be called e.g. transcript-gene.tsv if it were simply a map/hash; or if it held additional information, it could be called e.g. transcript-info.tsv.

StevenCannon-USDA commented 11 months ago

Progress on this task: I have regenerated the bed files for all annotation collections (121 currently). Now, each such collection has a bed.gz file with seven columns. The fourth has the transcript name and the seventh has the gene name. For example, zcat vigun.ZN016.gnm1.ann2.C7YV.gene_models_main.bed.gz | head -4

vigun.ZN016.gnm1.chr1   13793   14745   vigun.ZN016.gnm1.ann2.VuZN016.01G000100.2   0   -   vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1   13793   14745   vigun.ZN016.gnm1.ann2.VuZN016.01G000100.3   0   -   vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1   13793   15571   vigun.ZN016.gnm1.ann2.VuZN016.01G000100.1   0   -   vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1   16175   19400   vigun.ZN016.gnm1.ann2.VuZN016.01G000200.1   0   +   vigun.ZN016.gnm1.ann2.VuZN016.01G000200

I'll also try to incorporate the transcript name -- gene name correspondence in the pan-gene sets when I recalculate those.

sammyjava commented 11 months ago

Sounds like with the current timing I'll be able to load genes and proteins from the new pan-gene sets in 5.1.0.4. Not that it matters much, since I currently find the genes and proteins with a post-processor, but nice to do it in one swell foop.

legumeinfo / datastore-specifications

Add gene identifier column to pan-gene set .hsh file? #45