Peptides with * in sequence causing trouble (with netMHCcons)

kippakers commented 8 years ago

Hi Again,

I had to track down why one of my cases wasn't getting any results, here's what I found:

Topiary was outputting no epitopes, without a clear error message. I saved the temporary peptides fasta file topiary made, and ran it separately with netMHC3.4 netMHCpan2.8 and netMHCcons1.1. It ran fine with netMHC, with pan, but cons gave an error for different numbers of peptides from pan and netMHC. So I outputted the offending peptide arrays, and found the difference:

'KLNLPSEPX' peptide was made by 1 but not the other. Tracking down this to it's fasta base, it's: LTQMHYVLKLNLPSEP*. Going back one step further, the mutation is interpreted as:

Substitution(variant=chr8 g.142458077C>T, transcript_name=MROH5-001, transcript_id=ENST00000430863, effect_description=p.E917K) 8 LTQMHYVLKLNLPSEP* 17

And the original line in the VCF was:

chr8    142458077   _88UQnS8UQnS011 C   T   .   .   .   AF:DP   0.112:143

It looks like this transcript actually has a * in the reference:


       MDRQCSERPYSCTPTGRVSSAVSQNSRISPPVSTSMKDSSCMKVHQDSARRDRWSHPTTI       
       LLHKSQSSQATLMLQEHRMFMGEAYSAATGFKMLQDMNSADPFHLKYIIKKIKNMAHGSP       
       KLVMETIHDYFIDNPEISSRHKFRLFQTLEMVIGASDVLEETWEKTFTRLALENMTKATE       
       LEDIYQDAASNMLVAICRHSWRVVAQHLETELLTGVFPHRSLLYVMGVLSSSEELFSQED       
       KACWEEQLIQMAIKSVPFLSTDVWSKELLWTLTTPSWTQQEQSPEKAFLFTYYGLILQAE       
       KNGATVRRHLQALLETSHQWPKQREGMALTLGLAATRHLDDVWAVLDQFGRSRPIRWSLP       
       SSSPKNSEDLRWKWASSTILLAYGQVAAKARAHILPWVDNIVSRMVFYFHYSSWDETLKQ       
       SFLTATLMLMGAVSRSEGAHSYEFFQTSELLQCLMVLMEKEPQDTLCTRSRQQAMHIASS       
       LCKLRPPIDLERKSQLLSTCFRSVFALPLLDALEKHTCLFLEPPNIQLWPVARERAGWTH       
       QGWGPRAVLHCSEHLQSLYSRTMEALDFMLQSLIMQNPTADELHFLLSHLYIWLASEKAH       
       ERQRAVHSCMILLKFLNHNGYLDPKEDFKRIGQLVGILGMLCQDPDRATQRCSLEGASHL       
       YQLLMCHKTGEALQAESQAPKELSQAHSDGAPLWNSRDQKATPLGPQEMAKNHIFQLCSF       
       QVIKDIMQQLTLAELSDLIWTAIDGLGSTSPFRVQAASEMLLTAVQEHGAKLEIVSSMAQ       
       AIRLRLCSVHIPQAKEKTLHAITLLARSHTCELVATFLNISIPLDSHTFQLWRALGAGQP       
       TSHLVLTTLLACLQERPLPTGASDSSPCPKEKTYLRLLAAMNMLHELQFAREFKQAVQEG       
       YPKLFLALLTQMHYVLELNLPSEP*PKQQAQEAAVPSPQSCSTSLEALKSLLSTTGHWHD       
       FAHLELQGSWELFTTIHTYPKGVGLLARAMVQNHCRQIPAVLRQLLPSLQSPQERERKVA       
       ILILTKFLYSPVLLEVLPKQAALTVLAQGLHDPSPEVRVLSLQGLSNILFHPDKGSLLQG       
       QLRPLLDGFFQSSDQVIVCIMGTVSDTLHRLGAQGTGSQSLGVAISTRSFFNDERDGIRA       
       AAMALFGDLVAAMADRELSGLRTQVHQSMVPLLLHLKDQCPAVATQAKFTFYRCAVLLRW       
       RLLHTLFCTLAWERGLSARHFLWTCLMTRSQEEFSIHLSQALSYLHSHSCHIKTWVTLFI       
       GHTICYHPQAVFQMLNAVDTNLLFRTFEHLRSDPEPSIREFATSQLSFLQKVSARPKQ

Does this reproduce for you guys (allele=A*02:01)? Can we scrub out asterisks in future updates?

Thanks!!

iskandr commented 8 years ago

I'm not sure if scrubbing stop codons is the right approach. MROH5-001 is annotated as a polymorphic pseudogene. Unless we know that a particular individual lost that stop codon we probably shouldn't be using this as a protein coding transcript. Let me check the logic path that got us to use this transcript in the first place. Is the reference here GRCh38 or GRCh37?

kippakers commented 8 years ago

Oh I didn't know that indicated stop codon. This was GRCh37.

Wouldn't we expect the protein to get made up to the stop codon?

iskandr commented 8 years ago

Some background reading I've been looking at:

The GENCODE pseudogene resource

Polymorphic psuedogene: Locus known to be coding in some individuals but with disabling mutations in the reference genome

Different kinds of pseudogenes: polymorphic pseudogenes

One of the best examples is the human gene for N-acetylaminogalactosyl-transferase on chromosome 9. The normal version of this enzyme attaches a sugar called GalNAc (N-acetylaminogalactose) to proteins on the surface of red blood cells. These are recognized by antibodies as antigen A giving rise to type A blood. A mutation in the coding region of the gene causes the enzyme to transfer galactose (Gal) instead of GalNac and this is recognized by different antibodies giving rise to blood type B. ... A third allele of the same gene is an inactive pseudogene that doesn't work at all. This is responsible for O-type blood [ABO Blood Types]

Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins

Using a new bioinformatic method to analyze ribosome profiling data, we show that 40% of lncRNAs and pseudogene RNAs expressed in human cells are translated. In addition, ~35% of mRNA coding genes are translated upstream of the primary protein-coding region (uORFs) and 4% are translated downstream (dORFs). Translated lncRNAs preferentially localize in the cytoplasm, whereas untranslated lncRNAs preferentially localize in the nucleus. The translation efficiency of cytoplasmic lncRNAs is nearly comparable to that of mRNAs, suggesting that cytoplasmic lncRNAs are engaged by the ribosome and translated. While most peptides generated from lncRNAs may be highly unstable byproducts without function, ~9% of the peptides are conserved in ORFs in mouse transcripts, as are 74% of pseudogene peptides, 24% of uORF peptides and 32% of dORF peptides. Analyses of synonymous and nonsynonymous substitution rates of these conserved peptides show that some are under stabilizing selection, suggesting potential functional importance.

Are Human Translated Pseudogenes Functional?

By definition, pseudogenes are relics of former genes that no longer possess biological functions. Operationally, they are identified based on disruptions of open reading frames (ORFs) or presumed losses of promoter activities. Intriguingly, a recent human proteomic study reported peptides encoded by 107 pseudogenes. These peptides may play currently unrecognized physiological roles. Alternatively, they may have resulted from accidental translations of pseudogene transcripts and possess no function.

Pseudogenes in Human Cancer

Recently, it has emerged that pseudogenes represent a conspicuous part of the human transcriptome and proteome, as thousands of them are transcribed and hundreds are also translated

So, it looks like some fraction of these are translated and I guess particularly for a pseudogene that's "disabled" by a premature stop it should get translated once but then the mRNA will get degraded by NMD.

That invokes a broader question of how you want to deal with frameshifts & premature stop variants and transcripts that are likely to be subjected to NMD.

iskandr commented 8 years ago

The trouble with this particular transcript (and probably many other polymorphic pseudogenes) is that it's littered with stop-gain variants and frameshifts:

http://grch37.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000226807;r=8:142443932-142517330;t=ENST00000430863

I think there's very little to be learned from the reference sequence without also phasing some of the polymorphisms.

iskandr commented 8 years ago

So, to avoid a blizzard of false positives, I'm moving polymorphic_pseudogene to the noncoding category: https://github.com/hammerlab/pyensembl/pull/149/

kippakers commented 8 years ago

Thanks! that solution works for me.

kippakers commented 8 years ago

Though I do worry about s in peptides of other categories. The big problem is that it kills ALL the topiary output, so I think stripping out any 's or peptides with '*' midway through processing makes a lot of sense.

iskandr commented 8 years ago

@kippakers OK, also added a little bit of trimming logic to Topiary to not accidentally use stop codons.

openvax / topiary

Peptides with * in sequence causing trouble (with netMHCcons) #53