isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

GraphMap mapping beyond the ends of reference sequences #71

Closed gringer closed 7 years ago

gringer commented 7 years ago

I'm a little confused as to what's going on here. I'm trying to map contigs from a different assembly (generated from the same input nanopore reads) to the 33 longest contigs in an assembly, and getting mappings that extend past the end of the contig. Here are a couple of demonstrative images, first the assembly mapped to itself:

t3_contigs_2017-May-17.pdf

And next an assembly with slightly different assembly parameters mapped to the same contigs:

t3_2_contigs_2017-May-17.pdf

The text on each line appears at the end of the contig, so there are at least 6 mappings that extend past the end. I thought something along the lines of "okay, so it's got a good mapping to the start, and just extends the read as far as it can". But then I looked at the mpileup results:

tig00000128     204523  A       3       ,+1t.T-3AGA     ~~~
tig00000128     204524  A       3       ,T*     ~~~
tig00000128     204525  G       3       t.-1A*  ~~~
tig00000128     204526  A       3       ,+1t**  ~~~
tig00000128     204527  C       3       ,.-1G.-2GA      ~~~
tig00000128     204528  G       3       t+1c**  ~~~
tig00000128     204529  A       3       ,+2ac.* ~~~
tig00000128     204530  T       3       ,..     ~~~
tig00000128     204531  A       3       gT.$    ~~~
tig00000128     204532  G       2       ,.$     ~~
tig00000128     204533  N       1       g       ~
tig00000128     204534  N       1       t+1t    ~
tig00000128     204535  N       1       c       ~
tig00000128     204536  N       1       c       ~
tig00000128     204537  N       1       t       ~
tig00000128     204538  N       1       t       ~
tig00000128     204539  N       1       t       ~
tig00000128     204540  N       1       t       ~
tig00000128     204541  N       1       c-1n    ~
tig00000128     204542  N       1       *       ~
tig00000128     204543  N       1       a       ~
tig00000128     204544  N       1       a       ~
tig00000128     204545  N       1       c       ~
tig00000128     204546  N       1       c       ~
tig00000128     204547  N       1       a       ~
tig00000128     204548  N       1       a       ~
tig00000128     204549  N       1       a-1n    ~

Those Ns are where the contig finishes. The odd thing is that there are INDELs defined past the end of the contig, which makes no sense to me.

isovic commented 7 years ago

Hi David!

Thanks for the report! This bug has the same base as the Issue #60 which I just closed. Try doing a pull, make modules and make. The problem was a bug introduced in the last release which caused a faulty check of reference bounds. In your case, that happened for reverse complemented sequences only. I did a synthetic check for your situation locally and managed to reproduce it, and in the version 0.5.2 it is now fixed.

Thanks again!

Best regards, Ivan.