rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
553 stars 131 forks source link

unicycler_align: Output paths [feature request] #130

Open sjackman opened 6 years ago

sjackman commented 6 years ago

Is it possible to use a one of the Python functions of Unicycler to convert the SAM file output by unicycler_align to paths? Paths in GFA1 or GFA2 format would be awesome, but Unicycler formatted-paths would be great too.

rrwick commented 6 years ago

It may be possible, but unicycler_align doesn't align a long read to a graph, just to a single linear reference. Unicycler's finding-a-path-through-the-graph stuff comes later, and is conducted using multiple reads, not one read at a time.

unicycler_align is essentially just a long read aligner that:

  1. Aligns semi-globally - free end gaps in both reference and query.
  2. Has some tweaks to better handle aligning to references which terminate at repeat sequences (typical of short-read assembly contigs).
  3. Is quite slow :smile:

So I'm not sure if the concept of GFA path really applies here. Or am I misunderstanding you?

Ryan

sjackman commented 6 years ago

I thought (but perhaps I imagined this) that the first aligned segment of a read could align to somewhere in the middle of a contig to the end of that contig, then the following middle aligned segments of a read had to align the complete contig (beginning to end), then the final aligned segment of a read had to start at the beginning of a contig and could terminate anywhere in that contig. Looking now at https://github.com/rrwick/Unicycler/blob/master/docs/unicycler-align.md, it looks as though I did entirely make this up in my imagination.

sjackman commented 6 years ago

Want to create this tool? 😉

sjackman commented 6 years ago

On second thought, I wasn't totally off the mark. Semi-global alignment (aka glocal) alignment allows up to two ends to be clipped without penalty. Let's say a long read aligns (123) 1+, 2+, 3+ (456). Since an internal segment of the read aligns to 2+, the read is clipped on both sides, so there's no free clipping left for the contig, so the full length of the contig 2+ must align from beginning to end, without any clipping, which is exactly the behaviour that I want.

Similarly, the alignment to 1+ is clipped on one side of the contig, and one side of the read, which is the two allowable clippings that you get for free.

One difference from what I described above is that it doesn't check for an edge in the assembly graph between 1+ and 2+.