pmelsted / pizzly

Fast fusion detection using kallisto
BSD 2-Clause "Simplified" License
80 stars 10 forks source link

Non-printable characters (including ASCII 0) in fusion output FASTA sequences #11

Closed kreil closed 7 years ago

kreil commented 7 years ago

In a recent run, we got about ten instances where there are non-printable characters in the fusion output FASTA sequences. I rechecked, that cannot come from the sequences in the original FASTA reference file.

Is anyone else seeing this? Any ideas? I paste an example below - the non-printable characters are rendered as "^X" where X is a printable characters. "^@" for instance stands for ASCII character zero.

ENST00000409917_0:788ENST00000370321-10:1043 TCCTCGCAGGACCTCATGAGTAAGCTGTGGCGGCGTGGGAGCACCTCTGGGGCTATGGAGGCCCCTGAGCCGGGAGAAGCCCTGGAGTTGAGCCTGGCGGGTGCCCATGGCCATGGAGTGCACAAGAAAAAACACAAGAAGCACAAGAAGAAACACAAGAAGAAACACCATCAGGAAGAAGACGCCGGGCCCACGCAGCCGTCCCCTGCCAAGCCTCAGCTCAAACTCAAAATCAAGCTTGGGGGACAAGTCCTGGGGACCAAGAGTGTTCCTACCTTCACTGTGATCCCAGAGGGGCCTCGCTCACCCTCTCCCCTTATGGTTGTGGATAATGAAGAGGAACCTATGGAAGGAGTCCCCCTTGAGCAGTACCGTGCCTGGCTGGATGAAGACAGTAATCTCTCTCCCTCTCCACTTCGGGACCTATCAGGAGGGTTAGGGGGTCAGGAGGAAGAGGAGGAACAGAGGTGGCTGGATGCCCTGGAGAAGGGGGAGCTGGATGACAATGGAGACCTCAAGAAGGAGATCAATGAGCGGCTGCTTACTGCTCGACAGAGGAGATGCTGCTGAAGCGCGAGGAGCGGGCGCGGAAGCGGCGGCTCCAGGCGGCGCGGCGGGCAGAAGAGCACAAGAACCAGACTATCGAGCGCCTCACCAAGACTGCGGCGACCAGTGGGCGGGGAGGCCGGGGGGGCGCACGGGGCGAGCGGCGGGGAGGGCGGGCTGCGGCTCCGGCCCCCATGGTGCGCTACTGCAGCGGAGCACAGGGTTCCACCCTTTCCTTCCCA^@^@1^F^@^@ ^@^@^@^@CGCAAGGGCTGTGGCCCTTTTCCCACCCCCTAGCGCCGCTGGGCCTGCAGGTCTCTGTCGAGCAGCGGACGCCGGTCTCTGTTCCGCAGGATGGGGTTTGTTAAAGTTGTTAAGAATAAGGCCTACTTTAAGAGATACCAAGTGAAATTTAGAAGACGACGAGAGGGTAAAACTGATTATTATGCTCGGAAACGCTTGGTGATACAAGATAAAAATAAATACAACACACCCAAATACAGGATGATAGTTCGTGTGACAAACAGAGATATCATTTGTCAGATTGCTTATGCCCGTATAGAGGGGGATATGATAGTCTGCGCAGCGTATGCACACGAACTGCCAAAATATGGTGTGAAGGTTGGCCTGACAAATTATGCTGCAGCATATTGTACTGGCCTGCTGCTGGCCCGCAGGCTTCTCAATAGGTTTGGCATGGACAAGATCTATGAAGGCCAAGTGGAGGTGACTGGTGATGAATACAATGTGGAAAGCATTGATGGTCAGCCAGGTGCCTTCACCTGCTATTTGGATGCAGGCCTTGCCAGAACTACCACTGGCAATAAAGTTTTTGGTGCCCTGAAGGGAGCTGTGGATGGAGGCTTGTCTATCCCTCACAGTACCAAACGATTCCCTGGTTATGATTCTGAAAGCAAGGAATTTAATGCAGAAGTACATCGGAAGCACATCATGGGCCAGAATGTTGCAGATTACATGCGCTACTTAATGGAAGAAGATGAAGATGCTTACAAGAAACAGTTCTCTCAATACATAAAGAACAGCGTAACTCCAGACATGATGGAGGAGATGTATAAGAAAGCTCATGCTGCTATACGAGAGAATCCAGTCTATGAAAAGAAGCCCAAGAAAGAAGTTAAAAAGAAGAGGTGGAACCGTCCCAAAATGTCCCTTGCTCAGAAGAAGGATCGGGTAGCTCAAAAGAAGGCAAGCTTCC

pmelsted commented 7 years ago

This is a bug, from the name ENST00000409917_0:788_ENST00000370321_-10:1043 you can see that the second junction point is -10 which is 10bp before the transcript, so definitely a bug.

kmhernan commented 7 years ago

@kreil yes I am seeing this in my unfiltered fasta file with the same kind of negative position offset in the transcript as @pmelsted mentioned

pmelsted commented 7 years ago

This bug was fixed in the latest version, 0.37.3.