Closed kippakers closed 8 years ago
I'm not sure if scrubbing stop codons is the right approach. MROH5-001 is annotated as a polymorphic pseudogene. Unless we know that a particular individual lost that stop codon we probably shouldn't be using this as a protein coding transcript. Let me check the logic path that got us to use this transcript in the first place. Is the reference here GRCh38 or GRCh37?
Oh I didn't know that indicated stop codon. This was GRCh37.
Wouldn't we expect the protein to get made up to the stop codon?
Some background reading I've been looking at:
The GENCODE pseudogene resource
Polymorphic psuedogene: Locus known to be coding in some individuals but with disabling mutations in the reference genome
Different kinds of pseudogenes: polymorphic pseudogenes
One of the best examples is the human gene for N-acetylaminogalactosyl-transferase on chromosome 9. The normal version of this enzyme attaches a sugar called GalNAc (N-acetylaminogalactose) to proteins on the surface of red blood cells. These are recognized by antibodies as antigen A giving rise to type A blood. A mutation in the coding region of the gene causes the enzyme to transfer galactose (Gal) instead of GalNac and this is recognized by different antibodies giving rise to blood type B. ... A third allele of the same gene is an inactive pseudogene that doesn't work at all. This is responsible for O-type blood [ABO Blood Types]
Using a new bioinformatic method to analyze ribosome profiling data, we show that 40% of lncRNAs and pseudogene RNAs expressed in human cells are translated. In addition, ~35% of mRNA coding genes are translated upstream of the primary protein-coding region (uORFs) and 4% are translated downstream (dORFs). Translated lncRNAs preferentially localize in the cytoplasm, whereas untranslated lncRNAs preferentially localize in the nucleus. The translation efficiency of cytoplasmic lncRNAs is nearly comparable to that of mRNAs, suggesting that cytoplasmic lncRNAs are engaged by the ribosome and translated. While most peptides generated from lncRNAs may be highly unstable byproducts without function, ~9% of the peptides are conserved in ORFs in mouse transcripts, as are 74% of pseudogene peptides, 24% of uORF peptides and 32% of dORF peptides. Analyses of synonymous and nonsynonymous substitution rates of these conserved peptides show that some are under stabilizing selection, suggesting potential functional importance.
Are Human Translated Pseudogenes Functional?
By definition, pseudogenes are relics of former genes that no longer possess biological functions. Operationally, they are identified based on disruptions of open reading frames (ORFs) or presumed losses of promoter activities. Intriguingly, a recent human proteomic study reported peptides encoded by 107 pseudogenes. These peptides may play currently unrecognized physiological roles. Alternatively, they may have resulted from accidental translations of pseudogene transcripts and possess no function.
Recently, it has emerged that pseudogenes represent a conspicuous part of the human transcriptome and proteome, as thousands of them are transcribed and hundreds are also translated
So, it looks like some fraction of these are translated and I guess particularly for a pseudogene that's "disabled" by a premature stop it should get translated once but then the mRNA will get degraded by NMD.
That invokes a broader question of how you want to deal with frameshifts & premature stop variants and transcripts that are likely to be subjected to NMD.
The trouble with this particular transcript (and probably many other polymorphic pseudogenes) is that it's littered with stop-gain variants and frameshifts:
I think there's very little to be learned from the reference sequence without also phasing some of the polymorphisms.
So, to avoid a blizzard of false positives, I'm moving polymorphic_pseudogene
to the noncoding category: https://github.com/hammerlab/pyensembl/pull/149/
Thanks! that solution works for me.
Though I do worry about s in peptides of other categories. The big problem is that it kills ALL the topiary output, so I think stripping out any 's or peptides with '*' midway through processing makes a lot of sense.
@kippakers OK, also added a little bit of trimming logic to Topiary to not accidentally use stop codons.
Hi Again,
I had to track down why one of my cases wasn't getting any results, here's what I found:
Topiary was outputting no epitopes, without a clear error message. I saved the temporary peptides fasta file topiary made, and ran it separately with netMHC3.4 netMHCpan2.8 and netMHCcons1.1. It ran fine with netMHC, with pan, but cons gave an error for different numbers of peptides from pan and netMHC. So I outputted the offending peptide arrays, and found the difference:
'KLNLPSEPX' peptide was made by 1 but not the other. Tracking down this to it's fasta base, it's: LTQMHYVLKLNLPSEP*. Going back one step further, the mutation is interpreted as:
And the original line in the VCF was:
It looks like this transcript actually has a * in the reference:
Does this reproduce for you guys (allele=A*02:01)? Can we scrub out asterisks in future updates?
Thanks!!