Open ozgegizlenci opened 1 year ago
Louise's solution; Last exons identified by chexons will be checked backward through the exon. Nexons will check whether the identified exon has an unannotated splice acceptor within a flexibility range. For that, polyA/T strech is going to be used. If more than 60% of the sequence comprised of these pattern, it will be considered as polyA sequence and the identified exon will be removed. Remaining exons will be checked whether they have an annotated splice acceptor or they don't look like polyA sequence. Remaining exons will be accepted per transcript structure.
[x] New solution is implemented by @LouiseMatheson
[ ] Comparison between previous and new results of splice coordinates is going to be performed for example genes.
Some of the splice coordinates jump back to the beginning regardless of the strand direction of the gene. We need to check the extracted coordinates if they are always up for + strand or down for - strand.
Pkm example:
Variant103 of Pkm jumps back just once at the end. We can remove last coordinate/exon.
Variant103 ENSMUSG00000032294.17 59656649:59665200-59665366:59665863-59665954:59668549-59668680:59668913-59669099:59670467-59670737:59671575-59671729:59671921-59672073:59675568-59675734:59678044-59678225:59678728-59679373:59659611
In this Variant120, splice junction coordinate jumps back then carries on for another exon. We can kick them out altogether: Variant120 ENSMUSG00000032294.17 59656649:59665200-59665366:59665863-59665954:59668549-59668680:59668913-59669099:59670467-59670737:59671575-59671729:59671921-59672073:59675568-59675734:59678044-59678225:59678728-59679373:59659611-59659575:59659611