pombase / allele_qc

Quality control for PomBase alleles
MIT License
1 stars 1 forks source link

Fix mitochondria translation #22

Closed manulera closed 1 year ago

manulera commented 1 year ago

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG3

manulera commented 1 year ago

Hi @ValWood,

One thing that was missing is correctly accounting for mitochondrial translation. As part of the checks, I was checking whether CDS sequences from the contig files when translated matched the ones from the fasta files.

For mitochondrial genes, if I use the "normal" genetic code to translate genes from their DNA, I only get discrepancies on UGA having to be triptophan (W) instead of stop codon (see below, where | marks the mismatch between translation from the contig and fasta files).

SPMIT.03
contig
MLKQLSISAGNLLNKGTSETLRNEITTKKVSIHLPKHLKPANDSQFGHYLAGLIDGDGHFSSKQQLIIAFHSLDIQLAYYIKKQIGYGIVRKIKDKNAILFIIANSKGIERVITLINNKFRTTSKYNQIINNIFAHPRFKEFSKTITLGLNSNNNLNNHWLAGFSDADASFQIKILNRDKKIEVRLNYQIDQKKEYLLSLIKDNLGGNIGYRKSQDTYYYGSTSFGSAKKVINYFDNYHLLSSKYISYLK*RKAYLIIQENKHLTESGLSQIKKPHPYRKNIN*
                                                                                                                                                                                                                                                          |                                 
MLKQLSISAGNLLNKGTSETLRNEITTKKVSIHLPKHLKPANDSQFGHYLAGLIDGDGHFSSKQQLIIAFHSLDIQLAYYIKKQIGYGIVRKIKDKNAILFIIANSKGIERVITLINNKFRTTSKYNQIINNIFAHPRFKEFSKTITLGLNSNNNLNNHWLAGFSDADASFQIKILNRDKKIEVRLNYQIDQKKEYLLSLIKDNLGGNIGYRKSQDTYYYGSTSFGSAKKVINYFDNYHLLSSKYISYLKWRKAYLIIQENKHLTESGLSQIKKPHPYRKNIN*
fasta

SPMIT.06
contig
LRRCGIYVYPHRERDILCVKI*TIHLGSWGNPMPNRACVQKVLPVTKQISSDGSVQIDTVRAVLPEFQFPSHPQIGDCLS*IETFFSRSLVGFYDQGYTPGEESCTNSTIKGMSGKPTSINSNIYTTTGPAKVSNDYAVRDPGVAVDHFDQYGPLKEGRSLNSAKISTQ*SGSATLKSSNRSIFNIGLGYINTFLGVSNVRGFSTGSGRSKNVLNKLDDLSKRSKNYPNLVIDRNLYKDFLLNRDMFLIAYNKLKSNPGMMTPGLKPDTLDGMSIDVIDKIIQSLKSEEFNFTPGRRILIDKASGGKRPLTIGSPRDKLVQEILRIVLEAIYEPLFNTASHGFRPGRSCHSALRSIFTNFKGCTWWIEGDIKACFDSIPHDKLIALLSSKIKDQRFIQLIRKALNAGYLTENRYKYDIVGTPQGSIVSPILANIYLHQLDEFIENLKSEFDYKGPIARKRTSESRHLHYLMAKAKRENADSKTIRKIAIEMRNVPNKIHGIQSNKLMYVRYADDWIVAVNGSYTQTKEILAKITCFCSSIGLTVSPTKTKITNSYTDKILFLGTNISHSKNVTFSRHFGILQRNSGFILLSAPMDRIAKKLRETGLMLNHKGRSVIRWLPLDVRQIIGLANSIIRGYDNYYSFVHNRGRFATYVYFIIKDCVLRTLAHKLSLGTRMKVIKKFGPDLSIYDYNSRDENNKPKLITQLFKPSWKVNVWGFKSDKVKLNIRTLYASHLSMANLENLQCAACQSTYKVEMHHVRQMKNLKPIKGTLDYLMAKANRKQIPLCRSCHMKLHANKLTLNEDKKV*
                     |                                                          |                                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
LRRCGIYVYPHRERDILCVKIWTIHLGSWGNPMPNRACVQKVLPVTKQISSDGSVQIDTVRAVLPEFQFPSHPQIGDCLSWIETFFSRSLVGFYDQGYTPGEESCTNSTIKGMSGKPTSINSNIYTTTGPAKVSNDYAVRDPGVAVDHFDQYGPLKEGRSLNSAKISTQWSGSATLKSSNRSIFNIGLGYINTFLGVSNVRGFSTGSGRSKNVLNKLDDLSKRSKNYPNLVIDRNLYKDFLLNRDMFLIAYNKLKSNPGMMTPGLKPDTLDGMSIDVIDKIIQSLKSEEFNFTPGRRILIDKASGGKRPLTIGSPRDKLVQEILRIVLEAIYEPLFNTASHGFRPGRSCHSALRSIFTNFKGCTWWIEGDIKACFDSIPHDKLIALLSSKIKDQRFIQLIRKALNAGYLTENRYKYDIVGTPQGSIVSPILANIYLHQLDEFIENLKSEFDYKGPIARKRTSESRHLHYLMAKAKRENADSKTIRKIAIEMRNVPNKIHGIQSNKLMYVRYADDWIVAVNGSYTQTKEILAKITCFCSSIGLTVSPTKTKITNSYTDKILFLGTNISHSKNVTFSRHFGILQRNSGFILLSAPMDRIAKKLRETGLMLNHKGRSVIRWLPLDVRQIIGLANSIIRGYDNYYSFVHNRGRFATYVYFIIKDCVLRTLAHKLSLGTRMKVIKKFGPDLSIYDYNSRDENNKPKLITQLFKPSWKVNVWGFKSDKVKLNIRTLYASHLSMANLENLQCAACQSTYKVEMHHVRQMKNLKPIKGTLDYLMAKANRKQIPLCRSCHMKLHANKLTLNEDKKV*
fasta

SPMIT.08
contig
MQKNNLKNLITTIVTNAFFNQKANFSIPLKGVIGEKRPSILIGNININFKSDSLIEVSFPYYPLLNKNYPNPSIISNIIQKALSNHLLYSSKNYSFIVNIRALPISTPYGSSLIFSKYIAIIIGSNPKIASTLWIDPKRFINLPKLQSDSIFKILGLNVPKGWKGIHISLNLIK*NSLSSRGRITNIIKGSVPLTNNSNGYDESSLAIYSKMGTIQIKVRLSYSSNL*
                                                                                                                                                                              |                                                     
MQKNNLKNLITTIVTNAFFNQKANFSIPLKGVIGEKRPSILIGNININFKSDSLIEVSFPYYPLLNKNYPNPSIISNIIQKALSNHLLYSSKNYSFIVNIRALPISTPYGSSLIFSKYIAIIIGSNPKIASTLWIDPKRFINLPKLQSDSIFKILGLNVPKGWKGIHISLNLIKWNSLSSRGRITNIIKGSVPLTNNSNGYDESSLAIYSKMGTIQIKVRLSYSSNL*
fasta

However, it seems that UGA is the only different codon we are using for the translation of mitochondrial genes. If I use the yeast mitochondrial genetic code from the link you shared (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG3), I get that stop codon right, but other residues also change, and it does not seem we account for that in PomBase (see below). Are these different codons also used in pombe, or are they only in cerevisiae?

SPMIT.01
contig
MNSWWTYVNRWMFSTNAKDIAMTYLLFGLVSGMIGSVFSFMIRMETSAPGSQFTSGNGQLYNVAISAHGMTMIFFFIIPALFGAFGNYLVPTMMGAPDVAYPRVNNFTFWLTPPATMTLLISALTEEGPGGGWTVYPPTSSITSHSGPAIDTAILSLQLTGISSTLGSVNLMATMINMRAPGLSLYQMPLFAWAMMITSITLTTTLPVLAGGLFMLFSDRNLNTSFYAPEGGGDPVTYQHLFWFFGHPEVYILIMPAFGVVSHIIPSLAHKPIFGKEGMTWAMLSIALLGLMVWSHHLFTVGLDVDTRAYFSAATMVIAIPTGIKIFSWLATLTGGAMQWSRVPMLYAIGFLILFTIGGLTGVMLSNSVLDIAFHDTYFVVAHFHYVTSMGALFGTCGAYYYWSPKMFGLMYNETTASIQFWILFMGVNIVFGPQHFLGLNGMPRRMPDYPDAFVGWNFVSSIGSVISITSLFTFMYVMYDQFTSNRVVKTNPYLIPSYFDDNVIFVNEKLGVAQSMEWLTHSPVHEHAFNTLPTKSI*
           |         ||         |       |    |       |               ||                    | |                 |   | |                    |            |                   |                      |     | ||                                |                                          |                                                         |                         |                       |       |                   |         |                    |                      |   |                                          |   |                  
MNSWWTYVNRWIFSTNAKDIAILYLLFGLVSGIIGSVFSFIIRMELSAPGSQFLSGNGQLYNVAISAHGILMIFFFIIPALFGAFGNYLVPLMIGAPDVAYPRVNNFTFWLLPPALMLLLISALTEEGPGGGWTVYPPLSSITSHSGPAIDLAILSLQLTGISSTLGSVNLIATMINMRAPGLSLYQMPLFAWAIMITSILLLLTLPVLAGGLFMLFSDRNLNTSFYAPEGGGDPVLYQHLFWFFGHPEVYILIMPAFGVVSHIIPSLAHKPIFGKEGMLWAMLSIALLGLMVWSHHLFTVGLDVDTRAYFSAATMVIAIPTGIKIFSWLATLTGGAIQWSRVPMLYAIGFLILFTIGGLTGVILSNSVLDIAFHDTYFVVAHFHYVLSMGALFGLCGAYYYWSPKMFGLMYNETLASIQFWILFIGVNIVFGPQHFLGLNGMPRRIPDYPDAFVGWNFVSSIGSVISILSLFLFMYVMYDQFTSNRVVKTNPYLIPSYFDDNVIFVNEKLGVAQSIEWLLHSPVHEHAFNTLPTKSI*
fasta

SPMIT.02
contig
LKPQTKLSENSSRCEKVTYSEVITQLIYFLTSKKITNLGKIRTVKSIRDSFTSQLENILCFFLVYRTTYSFGVCLMKRFLFNKFFNRHPFTRVKSCFSSSSPSKFSFTQWLVGFTDGDGCFSISKQKMKNGKNKWSTTFKLTQNTYNYRILYFIKRNLGIGSTYKESSTNTVMYRLRRREHTKKIMDIFDQFPTLTKKYWDYYLFKKAFLILEDANTNSFEKNSKTEEIRMEKKSLKQYSPVNLEKYLTKSWLIGFIEAEGSFYLTQKSPVRMIHGFEITQNYEQPTTAQISEFTFNSQISPKMKSKKNSLITNYSLSTSSKERMLFTSSYFENCFKGVKSLEFKIWSRSLRKNYNFEQTLRARDLIRKLKNKYSRGSQHPKDK*
                 |                        |        |                                                                           |        |       |                 |         |        |   |       |                      |        |    |                                  |      |             ||      |        |                       |                               |                         
LKPQTKLSENSSRCEKVLYSEVITQLIYFLTSKKITNLGKIRLVKSIRDSFLSQLENILCFFLVYRTTYSFGVCLMKRFLFNKFFNRHPFTRVKSCFSSSSPSKFSFTQWLVGFTDGDGCFSISKQKIKNGKNKWSLTFKLTQNLYNYRILYFIKRNLGIGSLYKESSTNTVIYRLRRREHLKKIIDIFDQFPLLTKKYWDYYLFKKAFLILEDANLNSFEKNSKLEEIRIEKKSLKQYSPVNLEKYLTKSWLIGFIEAEGSFYLLQKSPVRIIHGFEITQNYEQPLLAQISEFLFNSQISPKIKSKKNSLITNYSLSTSSKERMLFLSSYFENCFKGVKSLEFKIWSRSLRKNYNFEQLLRARDLIRKLKNKYSRGSQHPKDK*
fasta

SPMIT.03
contig
MLKQLSMSAGNTLNKGTSETLRNEITTKKVSIHTPKHLKPANDSQFGHYLAGLIDGDGHFSSKQQLIIAFHSLDMQLAYYIKKQMGYGIVRKIKDKNAITFMMANSKGIERVMTLINNKFRTTSKYNQIINNMFAHPRFKEFSKTITLGLNSNNNLNNHWTAGFSDADASFQIKILNRDKKMEVRLNYQMDQKKEYTLSLIKDNLGGNMGYRKSQDTYYYGSTSFGSAKKVINYFDNYHLTSSKYISYLKWRKAYTIIQENKHLTESGTSQMKKPHPYRKNIN*
      |    |                     |                                        |         |              | ||         |                   |                           |                    |       |      |           |                               |              |            |  |            
MLKQLSISAGNLLNKGTSETLRNEITTKKVSIHLPKHLKPANDSQFGHYLAGLIDGDGHFSSKQQLIIAFHSLDIQLAYYIKKQIGYGIVRKIKDKNAILFIIANSKGIERVITLINNKFRTTSKYNQIINNIFAHPRFKEFSKTITLGLNSNNNLNNHWLAGFSDADASFQIKILNRDKKIEVRLNYQIDQKKEYLLSLIKDNLGGNIGYRKSQDTYYYGSTSFGSAKKVINYFDNYHLLSSKYISYLKWRKAYLIIQENKHLTESGLSQIKKPHPYRKNIN*
fasta

SPMIT.04
contig
MNTSTKFQGHPYHIVSASPWPFFLSVVLFFNCLAATLYLHGYKHSSVFFGISFLGLLATMYLWFRDMSTEANIHGAHTKAVTKGLKMGFMLFTISETFLFASIFWAFFHSSLSPTFELGAVWPPVGMADKTMDPLEVPTLNTVILLTSGASLTYAHYSLIARNRENALKGLYMTIALSFLFLGGQAYEYWNAPFTISDSVYGASFYFATGTHGIHIIVGTITTTVATYNIYTYHLTNTHHNGFECGIYYWHFCDVVWLFTYLTIYIWGS*
  |                                                                                   |     |                                 |    |      |                                                                       |          |||                                   |          
MNLSTKFQGHPYHIVSASPWPFFLSVVLFFNCLAATLYLHGYKHSSVFFGISFLGLLATMYLWFRDMSTEANIHGAHTKAVTKGLKIGFMLFLISETFLFASIFWAFFHSSLSPTFELGAVWPPVGIADKTIDPLEVPLLNTVILLTSGASLTYAHYSLIARNRENALKGLYMTIALSFLFLGGQAYEYWNAPFTISDSVYGASFYFATGLHGIHIIVGTILLLVATYNIYTYHLTNTHHNGFECGIYYWHFCDVVWLFLYLTIYIWGS*
fasta

SPMIT.05
contig
MKILKSNPFLALANNYMIDAPEPSNISYFWNFGSTLACVLVIQIVTGMTLACFYIPNMDLAFTSVERIVRDVNYGFLLRAFHANGASFFFIFLYLHMGRGLYYGSYKYPRTMTWNIGVIIFLLTIITAFLGYCLPANQMSFWGATVITNLLSAVPFIGDDLVHTLWGGFSVSNPTLNRFFSTHYLMPFVIAALSVMHLIATHTNGSSNPLGVTANMDRIPMNPYYTMKDLMTMFIFLIGMNYMAFYNPYGFMEPDCALPADPTKTPMSIVPEWYLLPFYAILRAMPNFQLGVMAMLLSILVLTLLPLLDFSAIRGNSFNPFGKFFFWTFVADFVITAWIGGSHPENVFITIGAIATIFYFSYFFILMPVYTMLGNTLIDLNLSSIKR*
                                  |            ||             |                                 |                                                                  |                 |                  |                        ||   | |      |                      |                     |       |         |                                |                              |    |                
MKILKSNPFLALANNYMIDAPEPSNISYFWNFGSLLACVLVIQIVTGILLACFYIPNMDLAFLSVERIVRDVNYGFLLRAFHANGASFFFIFLYLHIGRGLYYGSYKYPRTMTWNIGVIIFLLTIITAFLGYCLPANQMSFWGATVITNLLSAVPFIGDDLVHLLWGGFSVSNPTLNRFFSLHYLMPFVIAALSVMHLIALHTNGSSNPLGVTANMDRIPMNPYYLIKDLITIFIFLIGINYMAFYNPYGFMEPDCALPADPLKTPMSIVPEWYLLPFYAILRAIPNFQLGVIAMLLSILVLLLLPLLDFSAIRGNSFNPFGKFFFWTFVADFVILAWIGGSHPENVFITIGAIATIFYFSYFFILIPVYTILGNTLIDLNLSSIKR*
fasta

SPMIT.06
contig
LRRCGMYVYPHRERDILCVKMWTMHLGSWGNPMPNRACVQKVTPVTKQMSSDGSVQMDTVRAVLPEFQFPSHPQMGDCTSWMETFFSRSLVGFYDQGYTPGEESCTNSTIKGMSGKPTSINSNMYTTTGPAKVSNDYAVRDPGVAVDHFDQYGPLKEGRSLNSAKISTQWSGSATLKSSNRSMFNIGLGYINTFLGVSNVRGFSTGSGRSKNVTNKTDDLSKRSKNYPNTVIDRNTYKDFTLNRDMFTIAYNKLKSNPGMMTPGTKPDTLDGMSMDVIDKIIQSTKSEEFNFTPGRRILIDKASGGKRPLTIGSPRDKLVQEITRMVTEAIYEPLFNTASHGFRPGRSCHSATRSIFTNFKGCTWWIEGDIKACFDSIPHDKLIATLSSKIKDQRFIQLIRKALNAGYTTENRYKYDIVGTPQGSMVSPILANIYTHQLDEFIENTKSEFDYKGPMARKRTSESRHTHYTMAKAKRENADSKTIRKMAIEMRNVPNKMHGIQSNKTMYVRYADDWMVAVNGSYTQTKEILAKITCFCSSIGTTVSPTKTKMTNSYTDKMTFTGTNISHSKNVTFSRHFGMTQRNSGFILTSAPMDRIAKKTRETGTMTNHKGRSVIRWLPTDVRQIIGLANSIIRGYDNYYSFVHNRGRFATYVYFMIKDCVTRTLAHKLSLGTRMKVIKKFGPDTSIYDYNSRDENNKPKLITQLFKPSWKVNVWGFKSDKVKLNIRTTYASHLSMANLENTQCAACQSTYKVEMHHVRQMKNTKPIKGTLDYLMAKANRKQIPLCRSCHMKTHANKLTTNEDKKV*
     |              |  |                  |     |       |                 |   |  |                                         |                                                          |                              |  |            |     |    |      |                |         |         |                                      | | |                        |                                |                      |                |         |         |         |          |  |                |          |       |         |                         |        |       || |                 ||        |          |    | |            |                                   |     |                      |                                           |            |                     |                            |      |       
LRRCGIYVYPHRERDILCVKIWTIHLGSWGNPMPNRACVQKVLPVTKQISSDGSVQIDTVRAVLPEFQFPSHPQIGDCLSWIETFFSRSLVGFYDQGYTPGEESCTNSTIKGMSGKPTSINSNIYTTTGPAKVSNDYAVRDPGVAVDHFDQYGPLKEGRSLNSAKISTQWSGSATLKSSNRSIFNIGLGYINTFLGVSNVRGFSTGSGRSKNVLNKLDDLSKRSKNYPNLVIDRNLYKDFLLNRDMFLIAYNKLKSNPGMMTPGLKPDTLDGMSIDVIDKIIQSLKSEEFNFTPGRRILIDKASGGKRPLTIGSPRDKLVQEILRIVLEAIYEPLFNTASHGFRPGRSCHSALRSIFTNFKGCTWWIEGDIKACFDSIPHDKLIALLSSKIKDQRFIQLIRKALNAGYLTENRYKYDIVGTPQGSIVSPILANIYLHQLDEFIENLKSEFDYKGPIARKRTSESRHLHYLMAKAKRENADSKTIRKIAIEMRNVPNKIHGIQSNKLMYVRYADDWIVAVNGSYTQTKEILAKITCFCSSIGLTVSPTKTKITNSYTDKILFLGTNISHSKNVTFSRHFGILQRNSGFILLSAPMDRIAKKLRETGLMLNHKGRSVIRWLPLDVRQIIGLANSIIRGYDNYYSFVHNRGRFATYVYFIIKDCVLRTLAHKLSLGTRMKVIKKFGPDLSIYDYNSRDENNKPKLITQLFKPSWKVNVWGFKSDKVKLNIRTLYASHLSMANLENLQCAACQSTYKVEMHHVRQMKNLKPIKGTLDYLMAKANRKQIPLCRSCHMKLHANKLTLNEDKKV*
fasta

SPMIT.07
contig
MFITSPLEQFELNNYFGFYTFNYHFDFSNFGFYLGLSALIAISLAMMNLTPYGSGAKMVPQKFGMAMEAIYFTMLNLVENQIHSSKTVSGQSYFPFIWSLFVLILFSNFLGLIPYGYATTAQLIFTLGLSISILIGATILGLQQHKAKFFGLFLPSGTPTPLIPTLVLIEFVSYIARGLSLGIRLGANIMAGHLTMSILGGLIFTFMGLNTITFIIGFLPITVLVAISLLEFGIAFIQAYVFAILTCGFINDSLNTH*
                   |                         ||          |      |                                                                                                   |                        |                    |                                            |  
MFITSPLEQFELNNYFGFYLFNYHFDFSNFGFYLGLSALIAISLAIINLTPYGSGAKIVPQKFGIAMEAIYFTMLNLVENQIHSSKTVSGQSYFPFIWSLFVLILFSNFLGLIPYGYATTAQLIFTLGLSISILIGATILGLQQHKAKFFGLFLPSGTPTPLIPLLVLIEFVSYIARGLSLGIRLGANIIAGHLTMSILGGLIFTFMGLNLITFIIGFLPITVLVAISLLEFGIAFIQAYVFAILTCGFINDSLNLH*
fasta

SPMIT.08
contig
MQKNNLKNLITTIVTNAFFNQKANFSMPTKGVIGEKRPSILMGNINMNFKSDSLIEVSFPYYPLTNKNYPNPSIMSNIMQKATSNHTLYSSKNYSFIVNIRATPISTPYGSSLIFSKYMAMMMGSNPKMASTLWMDPKRFINLPKLQSDSMFKILGLNVPKGWKGIHISLNLIKWNSTSSRGRMTNMIKGSVPLTNNSNGYDESSLAIYSKMGTIQIKVRLSYSSNT*
                          | |            |    |                 |         |   |   |   |               |               | |||     |     |               |                          |     |  |                                       | 
MQKNNLKNLITTIVTNAFFNQKANFSIPLKGVIGEKRPSILIGNININFKSDSLIEVSFPYYPLLNKNYPNPSIISNIIQKALSNHLLYSSKNYSFIVNIRALPISTPYGSSLIFSKYIAIIIGSNPKIASTLWIDPKRFINLPKLQSDSIFKILGLNVPKGWKGIHISLNLIKWNSLSSRGRITNIIKGSVPLTNNSNGYDESSLAIYSKMGTIQIKVRLSYSSNL*
fasta

SPMIT.09
contig
MPQLVPFYFINILSFGFLIFTVLLYISSVYVLPRYNELFISRSIMSSL*
                                            |    
MPQLVPFYFINILSFGFLIFTVLLYISSVYVLPRYNELFISRSIISSL*
fasta

SPMIT.10
contig
MMQAAKYIGAGLATIGVSGAGVGIGLIFSNLISGTSRNPSVRPHLFSMAITGFALTEATGLFCLMLAFLIIYAA*
 |                                                |                        
MIQAAKYIGAGLATIGVSGAGVGIGLIFSNLISGTSRNPSVRPHLFSMAILGFALTEATGLFCLMLAFLIIYAA*
fasta

SPMIT.11
contig
MLFFNSMLNDAPSSWATYFQDGASPSYLGMTHLNDYTMFYTTFIFIGVIYAICKAVMEYNYNSHPIAAKYTTHGSIVEFIWTLIPALILILVALPSFKTLYLLDEVQKPSMTVKAIGRQWFWSYELNDFVTNENEPVSFDSYMVPEEDLEEGSTRQLEVDNRLVTPMDTRMRLILTSGDVIHSWAVPSLGMKCDCIPGRLNQVSLSIDREGLFYGQCSETCGVTHSSMPIVVQGVSTEDFLAWLEENS*
      |         |            |      |   |               |                                         |                                                      |          | |   |                   |                            |   |            |            
MLFFNSILNDAPSSWALYFQDGASPSYLGITHLNDYLMFYLTFIFIGVIYAICKAVIEYNYNSHPIAAKYTTHGSIVEFIWTLIPALILILVALPSFKLLYLLDEVQKPSMTVKAIGRQWFWSYELNDFVTNENEPVSFDSYMVPEEDLEEGSLRQLEVDNRLVLPIDTRIRLILTSGDVIHSWAVPSLGIKCDCIPGRLNQVSLSIDREGLFYGQCSELCGVLHSSMPIVVQGVSLEDFLAWLEENS*
fasta
manulera commented 1 year ago

Also I noticed that in both cases SPMIT.06 starts with L, but the same entry in uniprot starts with M https://www.uniprot.org/uniprotkb/P05511/entry#sequences

ValWood commented 1 year ago

Oh I don't know about this. I always thought that we used the yeast translation table. I have been Googling but this doesn't help https://en.wikipedia.org/wiki/List_of_genetic_codes

@kimrutherford Or @Lilindu may know.

ValWood commented 1 year ago

Chatbot says

Schizosaccharomyces pombe, also known as fission yeast, uses the standard mitochondrial genetic code (NCBI code number 2). This code is identical to the standard nuclear genetic code (NCBI code number 1), with the exception of the codons UAG, UGA, and UAA, which are normally stop codons in the nuclear code, but are instead assigned to the amino acids leucine, tryptophan, and glutamine in the mitochondrial code.

ValWood commented 1 year ago

I apologize for not providing a citation for my previous response. The information I provided about the mitochondrial genetic code used by Schizosaccharomyces pombe is widely accepted in the scientific literature, and can be found in many sources. One example is the NCBI website, which states that the mitochondrial genetic code used by S. pombe is the standard mitochondrial code (NCBI code number 2).

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=t#SG2

This is a secondary resource, primary research articles or textbooks would be more reliable sources.

ValWood commented 1 year ago

should be in here: https://pubmed.ncbi.nlm.nih.gov/2065367/

ValWood commented 1 year ago

It seems to be a bit more complicated:

It uses the “mitochondrial” genetic code, i.e., it contains a TGA codon, whereas all other protein-encoding genes, and all but one intronic open reading frame, use the “standard” genetic code (UGG for tryptophan).

manulera commented 1 year ago

Maybe this is the right reference?

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=4896

ValWood commented 1 year ago

SPMIT.06 starts with L,

Yep this is the intronic ORF. Kim will need to do some manipulation to fix this in Chado, I think because it seems to be a hybrid translation table!

manulera commented 1 year ago

Maybe this is the right reference?

This seems to be one that only accounts for UGA being W. If I use that, i dont' get any discrepancies with pombase translations.

kimrutherford commented 1 year ago

Kim will need to do some manipulation to fix this in Chado, I think because it seems to be a hybrid translation table!

Is this what you need?: https://www.ebi.ac.uk/ena/WebFeat/qualifiers/transl_except.html

kimrutherford commented 1 year ago

Maybe this is the right reference? https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=4896

I double checked the loading code. We are using translation table 4 like that reference says we should.

kimrutherford commented 1 year ago

SPMIT.06 starts with L,

Hi Val. Do you know why it is "M" in UniProt? Perhaps it's a manual fix at their end?

lilindu commented 1 year ago

https://pubmed.ncbi.nlm.nih.gov/31364709/ Protein-coding genes in the assembled mitogenomes were annotated using MFannot based on genetic code 4 (the only difference between genetic code 4 and the standard code is UGA being a tryptophan codon, not a stop codon)

https://pubmed.ncbi.nlm.nih.gov/12527786/ TGA codons are found in rps3 of S.pombe (one of three trp codons is TGA, 33%) and in the intronic ORFs encoded in the mtDNAs of all three species (12 of 71 codons in intronic ORFs are TGA, 17%). If UGA was read as a stop codon during translation, this would result in truncated protein products, possibly compromising mitochondrial function. In fact, a truncated rps3 transcript has been shown to cause a respiratory‐deficient phenotype in S.pombe Based on protein sequence alignments (not shown), ATA codes for isoleucine in S.pombe and S.japonicus var. japonicus (following the universal translation code) as opposed to methionine, as it does in animal and S.cerevisiae mitochondria.

lilindu commented 1 year ago

SPMIT.06 is encoded by a mitochondrial intron. Intron-encoded proteins (IEPs) usually are translated as a chimeric protein fused at the N-terminus with the protein encoded by upstream exon(s). Thus, the translation start codon of SPMIT.06 should be the start codon of cob1 / SPMIT.05

https://pubmed.ncbi.nlm.nih.gov/31364709/ In S. pombe, all previously analyzed mitochondrial intron IEPs are thought to be translated as fusions with upstream exons, as the coding sequences of IEPs are always in-frame with 5′ exons ([Schäfer 2003]. We observed an exception to this rule in several cox1-I1b′ sequences. In three MT types (MT52, MT53, and MT66), the LAGLIDADG-domain-coding sequences in cox1-I1b′ are out-of-frame with 5′ exons due to a one-nucleotide insertion about 70 bp upstream of the LAGLIDADG domains ([supplementary fig. S3]. This observation raises the possibility that S. pombe mitochondrial intron IEPs may not always be translated as in-frame extensions of the preceding exons.

https://pubmed.ncbi.nlm.nih.gov/21481283/ many proteins encoded within introns in organellar genomes are initially translated as fusions with upstream exon sequences, requiring subsequent proteolytic processing to provide an active protein with an amino terminus in domain IV [[40], [41], [66]]. Little is known about the molecular machinery required for this process,

https://pubmed.ncbi.nlm.nih.gov/34440770/ Many intron sequences that encode proteins are fused to the upstream exons of their host genes. This has been referred to as ‘core creep’ where the intron ORF over time has incorporated upstream intronic sequences to fuse in-frame to the upstream exon [[204]]. This fusion would allow the intron encoded protein to be more efficiently expressed, as it gains regulatory sequences of the host gene that optimize translation [[138],[204]].

lilindu commented 1 year ago

In our annotation of SPMIT.06 (cob-I1), the range of the CDS is written as "<10859..13282" to indicate that the annotated CDS is the C-terminal part of the fusion protein. https://www.ncbi.nlm.nih.gov/nuccore/MK618072 intron 10859..13384 /gene="cob" /note="group II intron" /number=1 gene <10859..13282 /gene="cob-I1" CDS <10859..13282 /gene="cob-I1" /note="intron-encoded protein" /codon_start=1 /transl_table=4 /product="reverse transcriptase domain-containing protein" /protein_id="QDP17088.1" /translation="LRRCGIYVYPHRERDILCVKIWTIHLGSWGNPMPNRACVQKVLP VTKQISSDGSVQIDTVRAVLPEFQFPSHPQIGDCLSWIETFFSRSLVGFYDQGYTPGE ESCTNSTIKGMSGKPTSINSNIYTTTGPAKVSNDYAVRDPGVAVDHFDQYGPLKEGRS LNSAKISTQWSGSATLKSSNRSIFNIGLGYINTFLGVSNVRGFSTGSGRSKNVLNKLD DLSKRSKNYPNLVIDRNLYKDFLLNRDMFLIAYNKLKSNPGMMTPGLKPDTLDGMSID VIDKIIQSLKSEEFNFTPGRRILIDKASGGKRPLTIGSPRDKLVQEILRIVLEAIYEP LFNTASHGFRPGRSCHSALRSIFTNFKGCTWWIEGDIKACFDSIPHDKLIALLSSKIK DQRFIQLIRKALNAGYLTENRYKYDIVGTPQGSIVSPILANIYLHQLDEFIENLKSEF DYKGPIARKRTSESRHLHYLMAKAKRENADSKTIRKIAIEMRNVPNKIHGIQSNKLMY VRYADDWIVAVNGSYTQTKEILAKITCFCSSIGLTVSPTKTKITNSYTDKILFLGTNI SHSKNVTFSRHFGILQRNSGFILLSAPMDRIAKKLRETGLMLNHKGRSVIRWLPLDVR QIIGLANSIIRGYDNYYSFVHNRGRFATYVYFIIKDCVLRTLAHKLSLGTRMKVIKKF GPDLSIYDYNSRDENNKPKLITQLFKPSWKVNVWGFKSDKVKLNIRTLYASHLSMANL ENLQCAACQSTYKVEMHHVRQMKNLKPIKGTLDYLMAKANRKQIPLCRSCHMKLHANK LTLNEDKKV"

ValWood commented 1 year ago

Of course!

ValWood commented 1 year ago

Actually I'm still a bit confused. I thought SPMIT.06 encodes a nuclease which directs its own excision?

ValWood commented 1 year ago

OK - SPMIT.06 is annotated incorrectly, it's a reverse transcriptase

I'm still trying to figure out the order of events!

lilindu commented 1 year ago

For mitochondrial intron-related early papers, please see this following paragraph. https://pubmed.ncbi.nlm.nih.gov/31364709/ In S. pombe, there are seven previously known mitochondrial introns, which are called cox1-I1a, cox1-I1b, cox1-I2a, cox1-I2b, cox1-I3, cob-I1, and cox2-I1 (Schäfer 2003) (fig. 1B). Three of them, cox1-I1b (Schäfer et al. 1991), cox1-I2b (Lang 1984), and cob-I1 (Lang et al. 1985), are present in the Leupold strain. The other four introns, cox1-I1a (Schäfer and Wolf 1999), cox1-I2a (Trinkl and Wolf 1986), cox1-I3 (Trinkl and Wolf 1986), and cox2-I1 (Schäfer et al. 1998), are absent in the Leupold strain.

lilindu commented 1 year ago

The ancestral forms of mitochondrial introns are ribozymes without any protein-coding sequences and can catalyze their own splicing.

IEPs are another type of selfish elements that often invade mitochondrial introns.

Some IEPs are homing endonucleases that can cut intron-free host genes to promote the mobilization of IEP-containing introns.

https://pubmed.ncbi.nlm.nih.gov/21481283/ In the case of group I and II introns, the host-parasite relationship is enriched by the fact that the introns themselves have been invaded by smaller parasitic elements - genes that encode mobility-promoting activities that enable the DNA element to move within and between genomes [10]. Thus, at least two levels of parasitism exist for mobile introns: the intron in the host gene it interrupts, and the invading gene in the intron. Collectively, the intron and its encoded mobility protein (often termed an intron-encoded protein, IEP) collaborate to form a composite mobile element that utilizes host DNA replication, recombination and repair pathways to spread [11], while the ribozyme activity ensures that it does not disrupt the function of genes into which it is inserted.

ValWood commented 1 year ago

So the intronic RNA is a self-splicing nuclease (to protect the host gene) , and the encoded protein which is translated tandemly with cox1 is a reverse transcriptase.

That's wild.

lilindu commented 1 year ago

I need to make a correction. For group I introns with homing endonucleases as IEPs, the model that IEP-free self-splicing ribozymes arose first and was later invaded by homing endonucleases is widely accepted. But for group II introns with reverse transcriptases (RTs) as IEPs, the evolutionary origin is not yet settled. There is an alternative model proposing that group II introns have evolved from a pre-existing RT-containing retroelement that cannot self-splice.

https://journals.asm.org/doi/full/10.1128/microbiolspec.MDNA3-0050-2014 Although it is often conjectured that group II introns and other ribozymes arose in a primordial “RNA world,” wherein catalysis was performed by RNA, the relationship of group II introns to the RNA world remains uncertain (193). The order of events for evolving mobile group II introns is also obscure, and appears to have occurred independently in at least two ways: through acquisition of an RT that imparted RNA-based mobility needed for intron dispersal in bacteria, and through acquisition of a LAGLIDADG DNA endonuclease that likely conferred the ability to mobilize the intron via DNA-based recombination in both bacteria and fungi. In both cases, the DNA encoding the catalytic RNA splicing apparatus is assumed to have preexisted an invasion event by different genes coding for mobilizing activities. But for RT-driven mobility, a coevolutionary scenario is also possible, where self-splicing ability developed from a retroelement under selective pressure to minimize transposon damage to the genome (194).

ValWood commented 1 year ago

This is super interesting. I will try to improve the basic annotation and product labels on these shortly. I have a ticket so I won't forget: https://github.com/pombase/curation/issues/3440