natir / fpa

Filter of Pairwise Alignement
MIT License
43 stars 5 forks source link

filtering low-identity regions of alignments #7

Open ekg opened 5 years ago

ekg commented 5 years ago

I'd like to remove parts of alignments that have low identity. The idea would be to take a longer alignment and break it into multiple alignments, removing regions where the identity drops below some threshold over a window of a given length. This would have to work on top of alignments with cigar strings.

The goal is to provide a controllable limit to collapse between diverged regions of sequences in graphs that are built from PAF based alignments. Applying this filter should make the graph have more large bubbles and be more "open", but have less small bubbles.

natir commented 5 years ago

I don't like the idea of using a filter based on the cigar string because it is not always present in all files. But this is not a fundamental problem.

To understand the idea of the filter, for this overlap:

A 100 50 100 + B 100 0 50 10 50 50 255 cg:Z:15I10X15I

fpa must split this overlaps and give in output the two "good" part of overlap or just filter out this overlap

ekg commented 5 years ago

I don't like the idea of using a filter based on the cigar string because it is not always present in all files. But this is not a fundamental problem.

I do understand you. I appreciate this is a new direction for fpa as you aren't working with these strings before. On my side, I can't really work without the cigar strings.

To understand the idea of the filter, for this overlap: fpa must split this overlaps and give in output the two "good" part of overlap or just filter out this overlap

That'd be the idea. No worries if this isn't something trivial for you to do or useful for your work. I can implement the modifier in another context.

natir commented 5 years ago

At the moment my parser ignores the optional fields of the paf and its would require time to adapt it and create a cigar string parser.

This feature seems very interesting/important to me but requires a lot of code to be written and I unfortunately don't have time for write it yet.

If you want to have this behaviour quickly, you may have to develop it yourself.