Closed dudcha closed 1 year ago
Hi Olga,
good catch. You likely did zcat proteinAlignments.fa.gz | g PROT What are you trying to do? If you want to extract the alignments of LEPROT, you can grep for that. Or more specifically, grep for the ENST... transcript.
Michael
We just noticed CODON entries in prot.fasta and traced them to this. -Olga
Not tested but perhaps it boils down to lines 242 and 243 in merge_cesar_output.py? if "PROT" -> if "| PROT |"?
Alright, this is something @kirilenkobm should pls have a look.
Bogdan: This affects new TOGA runs, not the download data we provide. In our recent runs, prot.fasta looks like
g codon -i prot.fasta -C 5
>ENST00000518192.LEPROTL1.431 | PROT | REFERENCE
MKWLAVSEESRRDLDGHTRFAGVCIRER-------ALISLSFGGAIGLMFLMLGCALPIYNKYWPLFVLFFYILSPIPYCIARRLVDDTDAMSNACKELAIFLTTGIVVSAFGLPIVFARAHLIEWGACALVLTGNTVIFATILGFFLVFGSNDDFSWQQWX
>ENST00000518192.LEPROTL1.431 | PROT | QUERY
----------R-----HARSAGVCLRARDGGRPAGALISLSFGGAIGLMFLMLGCALPIYNQYWPLFVLFFYILSPIPYCIARRLVDDTDAMSNACKELAIFLTTGIVVSAFGLPVVFARAHLIEWGACALVLTGNTVIFATILGFFLVFGSNDDFSWQQWX
>ENST00000518192.LEPROTL1.431 | CODON | REFERENCE
ATG AAG TGG TTG GCG GTC AGT GAG GAG TCC CGT CGC GAC TTG GAC GGC CAC ACA CGT TTT GCA GGA GTT TGC ATC CGA GAG AGA --- --- --- --- --- --- --- GCT TTG ATT AGT TTG TCC TTT GGA GGA GCA ATC GGA CTG ATG TTT TTG ATG CTT GGA TGT GCC CTT CCA ATA TAC AAC AAA TAC TGG CCC CTC TTT GTT CTA TTT TTT TAC ATC CTT TCA CCT ATT CCA TAC TGC ATA GCA AGA AGA TTA GTG GAT GAT ACA GAT GCT ATG AGT AAC GCT TGT AAG GAA CTT GCC ATC TTT CTT ACA ACG GGC ATT GTC GTG TCA GCT TTT GGA CTC CCT ATT GTA TTT GCC AGA GCA CAT CTG ATT GAG TGG GGA GCT TGT GCA CTT GTT CTC ACA GGA AAC ACA GTC ATC TTT GCA ACT ATA CTA GGC TTT TTC TTG GTC TTT GGA AGC AAT GAC GAC TTC AGC TGG CAG CAG TGG XXX
>ENST00000518192.LEPROTL1.431 | CODON | QUERY
--- --- --- --- --- --- --- --- --- --- CGT --- --- --- --- --- CAC GCA CGT TCC GCA GGA GTT TGT CTG CGA GCG AGA GAC GGG GGC CGG CCG GCC GGG GCT TTG ATT AGT TTG TCC TTT GGA GGA GCA ATT GGG CTG ATG TTT TTG ATG CTT GGA TGT GCC CTT CCA ATA TAC AAC CAA TAC TGG CCC CTC TTT GTT CTC TTT TTT TAC ATC CTT TCA CCT ATT CCA TAC TGC ATA GCC AGA AGA TTA GTG GAT GAT ACA GAT GCT ATG AGT AAT GCT TGT AAG GAA CTT GCC ATA TTT CTT ACA ACA GGC ATT GTT GTC TCA GCT TTT GGA CTC CCT GTT GTA TTT GCC AGA GCA CAT CTG ATT GAG TGG GGA GCT TGT GCA CTT GTT CTC ACA GGA AAC ACA GTC ATC TTT GCA ACT ATA CTG GGC TTT TTC TTG GTC TTT GGA AGC AAT GAC GAC TTC AGC TGG CAG CAG TGG XXX
>ENST00000518192.LEPROTL1 | 0 | 431 | reference_exon
ATGAAGTGGTTGGCGGTCAGTGAGGAGTCCCGTCGCGACTTGGACGGCCACACACGTT
Thx for catching that bug.
Dear Olga,
Thank you a lot for a good catch. Indeed, "PROT" -> "| PROT |" was the solution...
Hey Bogdan,
Thanks again for the tool. We continue learning about TOGA and exploring the output.
A quick note: there seems to be a small bug due to PROT string matching when parsing output that results in problems when the protein name has PROT in it. As a result, e.g. these will end up in the prot.fasta:
Thanks! Olga