Open sammyjava opened 1 year ago
@sammyjava Can you give an example? Here is the status quo:
zcat < Glycine.pan4.RK4P.hsh.tsv.gz | head
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261700.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262300.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G263500.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G264200.1
Glycine.pan4.pan00001 glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_06G281800.1
Glycine.pan4.pan00001 glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G278500.1
Glycine.pan4.pan00001 glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_06G279500.1
Glycine.pan4.pan00001 glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_06G280300.1
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100.1 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G261100
Glycine.pan4.pan00001 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900.1 glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.06G262900
Simple in this example but we have cases where the transcript identifiers are NOT a number appended to the gene identifier, like
Arachis.pan2.pan00001 arahy.BaileyII.gnm1.ann1.IDmodified-mrna-2217 arahy.BaileyII.gnm1.ann1.IDmodified-gene-2104
Note: this isn't needed for mine loading, since I already know which genes go with which transcripts from the GFFs. It's more about ease of use for users, such as the GCV.
I see. Will need to think about how to do it robustly.
Yeah, I think the only way right now is to use the transcript-gene child-parent relation in the GFF. I am updating my mine loader to do that properly in a post-processor (just load transcripts from the pan-gene sets and then find the genes and proteins in post).
FWIW, when I need to do something similar in the context of gene family assignments, I extract a transcript2gene lookup from the gff (since mRNA records contain both their ID and the Parent attribute, it is pretty straightforward, at least assuming the order of ID and Parent attributes doesn't get switched around...).
I'm curious as to what "extract a transcript2gene lookup" means. Does that refer to a program, like gffread? Or something you've hacked?
Completely the latter. From a comment to one of the scripts that uses such a thing:
#if you need a transcript2gene lookup, something like this should suffice
#zgrep ' mRNA ' medtr.HM004.gnm1.ann1.2XTB.gene_models_main.gff3.gz | sed 's/.*ID=\([^;]*\).*Parent=\([^;]*\).*/\1\t\2/' > transcript2gene
Having given this some more thought ...
This would be easily enabled with a gene identifier column in the .hsh file.
My preference is to do it as a separate transcript-gene hash file. Two reasons: first, the pangene "hsh.tsv" file is the end product of a lengthy process, and I don't want to carry the extra information (gene ID) through that process; and second, I think of a hash file as having two fields (definitionally).
Admittedly, neither of these reasons is particularly strong -- but I don't think of any good reason not to store the transcript-gene mapping in a separate file.
Implementation: I propose re-generating all BED files in the Data Store annotations, to have seven columns:
molecule, feature-start, feature-end, mRNA-ID, score(0), strand, gene-ID
The seventh column (gene-ID) would be nonstandard for BED, but would be an innocuous addition I think. It would be the per-specification BED6, with an extra seventh column holding the gene-ID.
This would have the advantage of making another place to find this mapping easily (it would become one of the standard files in an annotation set); and it would also allow regularizing the BED files in the Data Store. They are currently without standard names:
find . -name "*bed.gz" | perl -pe 's/.+\.(\w+\.bed)\.gz/$1/' | sort | uniq -c
67 cds.bed
1 gene_models_lowqual.bed
54 gene_models_main.bed
1 transdec_gff3.bed
In the update, all would be gene_models_main.bed
Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files.
No opinion here. Like I said, I use the annotation GFFs to connect transcripts to genes (and proteins, when they have the same identifier), so the extra data in the pan-gene set collections isn't needed for mine loading, but I do get that people like to see gene names pretty much everywhere.
@StevenCannon-USDA I have no objection to augmenting the bed files with genes and standardizing naming. But when you say
Then, during creation of a pangene or gene family set, the transcript-gene hash will just be derived from columns 4 and 7 in the BED files. does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too? Because as @sammyjava says "people like to see gene names pretty much everywhere"- and I just know you want to store a reference to an array as values in your hashes... ;)
Seriously, though, on analogy with the gene family assignment files I could at least imagine having some metadata about a gene's inclusion in a pangene set and storing it there (similar to how we store goodness of match to the HMM in the gfa files). Some off-the-cuff examples of such metadata that come to mind (none of them very compelling, but maybe could inspire some other thoughts about it)
pretty much just thinking out loud there, and don't feel strongly about any of it.
does this imply you would use that hash to add a representation of the genes somewhere in the pangene set files too?
Yes. Something like the following (contrived, but with real data).
for file in *bed.gz; do zcat $file | head -1 | cut -f4,7; done
arath.Col0.gnm9.ann11.AT2G44175.1 arath.Col0.gnm9.ann11.AT2G44175
glyma.Wm82.gnm4.ann1.Glyma.20G155500.1 glyma.Wm82.gnm4.ann1.Glyma.20G155500
medtr.A17_HM341.gnm4.ann2.Medtr3g070500.3 medtr.A17_HM341.gnm4.ann2.Medtr3g070500
phavu.G19833.gnm2.ann1.Phvul.005G131400.1 phavu.G19833.gnm2.ann1.Phvul.005G131400
pissa.Cameor.gnm1.ann1.Psat0s798g0080.1 pissa.Cameor.gnm1.ann1.Psat0s798g0080
prupe.Lovell.gnm2.ann1.Prupe.7G026800.1 prupe.Lovell.gnm2.ann1.Prupe.7G026800
sento.Myeongyun.gnm1.ann1.Sto13g434570 sento.Myeongyun.gnm1.Sto13g434570
singl.CAF01.gnm1.ann1.evm_model_Chr03_3668 singl.CAF01.gnm1.ann1.evm_TU_Chr03_3668
vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150.1 vigun.IT97K-499-35.gnm1.ann2.Vigun06g214150
vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070.1 vitvi.PN40024.gnm2.ann1.VIT_200s0271g00070
In this sample, all are trivial except Sindora (singl). As you point out, it needn't be restricted to two columns - though this example is. It could be called e.g. transcript-gene.tsv
if it were simply a map/hash; or if it held additional information, it could be called e.g. transcript-info.tsv
.
Progress on this task: I have regenerated the bed files for all annotation collections (121 currently).
Now, each such collection has a bed.gz file with seven columns. The fourth has the transcript name and the seventh has the gene name. For example, zcat vigun.ZN016.gnm1.ann2.C7YV.gene_models_main.bed.gz | head -4
vigun.ZN016.gnm1.chr1 13793 14745 vigun.ZN016.gnm1.ann2.VuZN016.01G000100.2 0 - vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1 13793 14745 vigun.ZN016.gnm1.ann2.VuZN016.01G000100.3 0 - vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1 13793 15571 vigun.ZN016.gnm1.ann2.VuZN016.01G000100.1 0 - vigun.ZN016.gnm1.ann2.VuZN016.01G000100
vigun.ZN016.gnm1.chr1 16175 19400 vigun.ZN016.gnm1.ann2.VuZN016.01G000200.1 0 + vigun.ZN016.gnm1.ann2.VuZN016.01G000200
I'll also try to incorporate the transcript name -- gene name correspondence in the pan-gene sets when I recalculate those.
Sounds like with the current timing I'll be able to load genes and proteins from the new pan-gene sets in 5.1.0.4. Not that it matters much, since I currently find the genes and proteins with a post-processor, but nice to do it in one swell foop.
A conversation amongst NCGR legumistas concluded that it would be nice to have gene identifiers in the pan-gene set collections. This would be easily enabled with a gene identifier column in the .hsh file.