Open ekg opened 6 years ago
I think the right way to handle inversions in surjection would be to describe them as sequence replacements (maybe adjacent deletion and insertion operations?). We should still produce an answer in SAM format even when the reads are relatively long and contain inversions.
Would it still break the old surjection algorithm? I think it's still available as a command line option.
Yeah I think so. We need to tackle this issue up front somehow.
On Tue, Jun 12, 2018, 21:49 Jordan Eizenga notifications@github.com wrote:
Would it still break the old surjection algorithm? I think it's still available as a command line option.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1727#issuecomment-396711408, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EbOq-KfwhXrTko0tmqvZAR_PXky_ks5t8BtCgaJpZM4UjICN .
Personally, I think it would really help to have some way of checking whether a path was "chunked" in this way besides having to go back to the graph to see if it's an adjacency that's present in the graph every single time. Then we could just break the surjection into subproblems and recombine them.
Is there a good way to represent this kind of alignment in SAM/BAM once we can surject it well?
We were aligning a paired-end sequenced ancient DNA sample, where the reads were merged when they appear to overlap in order to reduce the sequencing error rate and improve alignment accuracy in the face of DNA damage.
As a result some reads are longer than 256bp, which is our threshold for chunked alignment (given by the
-w
parameter tovg map
). Sometimes they represent inversions, and when this happens surjection fails with a terrifying error.If the mapper is being run in surjection mode, maybe it should force non-chunked alignment or invalidate the surjection when the mapped read represents an inversion or other non-SAM-representable alignment.
It could throw up warnings to set the
-w
size long enough for the given data and re-run.