isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

Free end gap alignments possible? #18

Closed rrwick closed 8 years ago

rrwick commented 8 years ago

I'd like to use GraphMap to align PacBio reads to contigs in an end gap free manner, as described here. However, I'm getting cases where my alignment is being soft-clipped before the end of the contig.

Here's my attempt to illustrate with a little example. Imagine the top sequence is my PacBio read and the bottom sequence is my contig:

             CTTTGGGCAAAC
             ||||||
ACGCGCATACATTCTTTGGCC
CIGAR: 6M6S

What I'd instead like is this, where the alignment always extends to the end of the reference contig:

             CTTTGGGCAAAC
             ||||||||
ACGCGCATACATTCTTTGGCC
CIGAR: 8M4S

I've tried lots of GraphMap parameters but can't seem to make it behave this way. Is it possible? Thanks!

rrwick commented 8 years ago

I've done a bit more investigation and have some more info. I was running my alignment using the anchorgotoh alignment approach. I think (correct me if I'm wrong), the alignment is only extending between the anchors and that's why the alignment was extending as far as it could. By allowing for smaller anchors (e.g. -A 6) I could get alignments closer to the end of my contig, but still not all the way.

I then thought that the gotoh (non-anchored) alignment might work better, but that didn't give me any alignment at all. A very low graph kmer (-k 3) produced an alignment, but an absurdly bad one, not the overlap alignment I wanted.

I then tried something a bit crazy which seemed to work. In the local_realignment_wrappers.cc file I changed some configs to seqan::AlignConfig<true, true, true, true> (lines 265, 270, 452 and 457). This was based on the seqan example here for overlap alignments. Now when I run in gotoh alignment mode, I get what I want: the alignment extends to the end of the contig and soft-clips after that.

Is there an better way of achieving this result without modifying the code?

rrwick commented 8 years ago

one_contig.fasta.txt one_read.fastq.txt

Here are the files I used to investigate. What I wanted was this overlap alignment: no overlap for the first 4112 bases of the contig. Then the start of the read aligns until the end of the contig. Then no alignment for the remaining 735 bp of the read:

read:                    NNNNNNNNNNNNNN
contig:  NNNNNNNNNNNNNNNNNNNN
         |    4112 bp   |    | 735 bp | 

Unmodified GraphMap (anchorgotoh) gave me this result which is missing about 20 bp of the alignment at the end of the contig: S1_80_10960 0 NODE_13_length_4288_cov_0.999297 4113 40 5M1D2M1D16M1I10M3I4M1I4M1I14M1I8M1I4M1D6M1D6M2I11M1I10M1I9M1D36M1I8M757S * 0 0 ... ... NM:i:19 AS:i:600 H0:i:1 ZE:f:1.62545e-38 ZF:f:0.139986 ZQ:i:923 ZR:i:4288

Unmodified GraphMap (gotoh) gave me no alignment: S1_80_10960 4 * 0 255 * * 0 0 ... ... NM:i:-1 AS:i:-923 H0:i:0 ZE:f:inf ZF:f:0 ZQ:i:923 ZR:i:0

And my modified GraphMap (gotoh) gave me the full alignment I wanted: S1_80_10960 0 NODE_13_length_4288_cov_0.999297 4113 40 5M1D2M1D16M1I10M3I4M1I4M1I14M1I8M1I4M1D6M1D6M2I11M1I10M1I9M1D36M1I7M1I3M1I6M1I6M1I4M735S * 0 0 ... ... NM:i:483 AS:i:640 H0:i:1 ZE:f:2.71283e-41 ZF:f:0.105068 ZQ:i:923 ZR:i:4288

rrwick commented 8 years ago

Alas, my modification does not solve the problem in all cases. I found another example where the unmodified GraphMap (using anchorgotoh) gave me a good alignment with no end gap but my modified GraphMap (using gotoh) gave me an end gap.

So I'm back to square one and my initial question: is there any way to make GraphMap perform an alignment that always extends to the ends of the aligned sequences?

isovic commented 8 years ago

Hi rrwick!

I'd like to use GraphMap to align PacBio reads to contigs in an end gap free manner, as described here. However, I'm getting cases where my alignment is being soft-clipped before the end of the contig. In the anchored mode, GraphMap by default does extend alignments from the last/first anchor to the corresponding ends of the read. In particular, what it does is:

  • align each anchor using global alignment
  • align in-between anchors using global alignment
  • align the part of the read after the last anchor (let's call this 'back' in the continuation) using semiglobal alignment (gaps at the end are not penalized)
  • align the part of the read before the first anchor (let's call this 'front' in the continuation) using semiglobal alignment (similar to the above)

Now, there are several cases when there might be soft clipping:

As for the non-anchored modes (gotoh and myers), although those might work for your purpose in some cases, they actually don't give you any guarantee for aligning such overhangs, as there might be an even better scoring alignment which doesn't span the gap.

Thank you for the sample data, I will take a look right now and let you know what I have found shortly!

P.S. For your purpose, would never clipping the alignment (even though there might be insertions at the front/back or a large edit distance) be what you're looking for? I could possibly, for now, create another branch which allows this as an option, and at a later point merge it into the master branch.

Best regards, Ivan.

isovic commented 8 years ago

Hi rrwick,

I created another branch which should have a fix for your issue, could you check it out and give it a test? I tested it on the data you shared and a handful of manually generated samples, it seems to work, but having your feedback would help a lot.

git pull
git checkout extanchorend
make -j

This should work for both -a anchor and -a anchorgotoh.

Please let me know how it performs. If it works well, I'll include the fix in the next release.

Best regards, Ivan.

rrwick commented 8 years ago

Yes, that definitely helped. Thanks so much for the quick response!

I see that there are still cases where GraphMap can still generate local alignments, where the middle of a read aligns to the middle of a contig with lots of soft clipping on each side. Here's an example where anchorgotoh produces 926S7M1D12M1I3M2I6M1I5M2D7M232S: one_read.fastq.txt one_contig.fasta.txt (Github only allows me to attach files if I make them end in '.txt')

However, this case is a spurious alignment where the alignment really shouldn't be extended to the sequence ends, so I can work with that.

Thank you again!

rrwick commented 8 years ago

One other thing: the binary I built from the extanchorend branch was crashing with a segfault sometimes. I found that if I build the index first (with -I) and then ran it, it did not crash. so I have a workaround.

The crash does not always happen in the same place, but it does happen after a fairly consistent amount of time. I can make it happen with these files, where on my computer it crashes around read 160-170: contig.fasta.txt reads.fastq.txt

It crashes for both anchor and anchorgotoh, but does not crash for myers or gotoh. And the number of threads seems to have no effect on the crash. But again, I have to delete the index files to make it crash - if they exist when GraphMap starts it runs fine.

I'm on a Mac, OSX 10.11.2, and I built GraphMap with GCC 5.3.0. If I can provide any more useful information, let me know.

isovic commented 8 years ago

Hi rrwick,

thank you for the feedback! I can't seem to reproduce the issue just by running GraphMap on these sample data, but I'm going to profile this thoroughly and let you know what I find. Which commit of GraphMap have you used to generate the old index files? If it was an older commit, the indexes might not be completely compatible, so re-indexing the references indeed might help. When the index format changes, GraphMap should automatically detect it and rebuild the index, unless I forgot to mark the change :-)

Best regards, Ivan.

rrwick commented 8 years ago

I was generating the index with the same version of GraphMap, not an older one. The commit I used was 0253c6c, the most recent commit in the extanchorend branch.

What happened was this: the first time I ran GraphMap it automatically built the index file, started aligning reads and then crashed a couple seconds later. I tried running it again (this time the index already existed from the first run) and it completed successfully. A bit of experimenting revealed that when the index was absent and GraphMap had to build it, it crashed. It always crashed, but not at the same point. Sometimes it would align 162 reads before crashing, sometimes 167, etc. But if the index was built before GraphMap started, it ran fine.

I didn't experience this issue when I built from a7eb856, the most current commit in the master branch.

isovic commented 8 years ago

Hi rrwick,

would you mind giving it a shot now? I think the crash should be fixed in the latest commit on this branch.

Best regards, Ivan.

isovic commented 8 years ago

There was a bug introduced in this branch which had alignments in the beginning of some reads skipped from extension. If you pull the latest commit of the extanchorend branch, it should be fixed.

Let me know if it crashes still, Best regards, Ivan.

rrwick commented 8 years ago

Yes, that fixed the crash! Thanks!