iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Read mapping fails due to bidir_search returning an interval [a,a) #25

Closed iqbal-lab closed 8 years ago

iqbal-lab commented 8 years ago

PRG: ATCGCT5CCGCCGGCGA6G5TTTTT

BWM: $1423245223223323163544444 1423245223223323163544444$ 163544444$1423245223223323 223223323163544444$1423245 223323163544444$1423245223 23163544444$14232452232233 23223323163544444$14232452 23245223223323163544444$14 23323163544444$14232452232 245223223323163544444$1423 3163544444$142324522322332 3223323163544444$142324522 323163544444$1423245223223 3245223223323163544444$142 3323163544444$142324522322 3544444$142324522322332316 4$142324522322332316354444 423245223223323163544444$1 44$14232452232233231635444 444$1423245223223323163544 19=left 4444$142324522322332316354 44444$14232452232233231635
45223223323163544444$14232 22=right 5223223323163544444$142324 544444$1423245223223323163 63544444$14232452232233231

read which should map, but which does not:

overlap_end_allele1_and_rhs_flank GCGATTT

Here's the entry in the kmer precalc file 4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @|

[19,22) corresponds to the correct interval in the BWM starting with 444=TTT [22,23) is the interval caused by T5...5TT

Now, as I step through, starting at 444, thing make sense:

p kmer
$12 = std::vector of length 3, capacity 3 = {4 '\004', 4 '\004', 4 '\004\'} p sa_intervals $13 = std::list = {[0] = {first = 19, second = 22}, [1] = {first = 22, second = 23}}

Then we step into bidir_search_bwd from map.cpp, we have kmer=444 and c=1

As we step in here

while (it!=sa_intervals.end() && it_rev!=sa_intervals_rev.end() && it_s!=sites.end()) { //calculate sum to return- can do this in top fcns => if (bidir_search(csa,(*it).first,(*it).second,(*it_rev).first,(*it_rev).second,c)>0)

we get into bidir_search.cpp with input args (left, right,c) = (19,22,1)

We get c_begin=1 - makes sense Then:

//rank_l is the number of times c occurs in BWT[0,left]  - number of 1s in [0,19]                                                                                      

(gdb) p rank_l
$19 = 1 << correct, look at RH column of BWM above

// s is total number of times                                                                                                                  
//      chars smaller than c occur in BWT[left,right]                                                                                          

(gdb) p s
$20 = 0 <<<< correct

// b is total number of times                                                                                                                  
//      chars bigger than c occur in BWT[left,right] 

(gdb) p b
$21 = 3 <<< correct

Now this:

//rank_r is the number of times c occurs in BWT[0,right]                                                                                       
uint64_t rank_r = right - left - s - b + rank_l - 1;

p rank_r | $28 = 0

If we look at RH column the BWM, this should be 1 - it certainly can't be less than rank_l which was 1

That I think must be wrong.

As a result of this

left = c_begin + rank_l;

(gdb) p left $29 = (uint64_t &) @0x69f8f0: 2

right = c_begin + rank_r + 1;

(gdb) p right $30 = (uint64_t &) @0x69f8f8: 2

This is because rank_r<rank_l

Back in bidir_search_bwd.cpp, we return here

                    while (it!=sa_intervals.end() && it_rev!=sa_intervals_rev.end() && it_s!=sites.end()) {
                      //calculate sum to return- can do this in top fcns                                                                       
                      if (bidir_search(csa,(*it).first,(*it).second,(*it_rev).first,(*it_rev).second,c)>0) {
                        ++it;
                        ++it_rev;
                        ++it_s;
                      }

That if condition fails, because bidir_search returns right-left=0. So we never increment the iterator. We now have iterator.first=2, iterator.second=2 (left and right), which we then erase it=sa_intervals.erase(it);

And from then on there's no way the read can map. The next loop round, we start on the sa_interval [22,23) which is the false one from T5...5TT, which doesn't extend to allow the read to map. Again we go into bidir_search rank_l=1 rank_r=0 s=0 b=1 and that results in right=left=2

.. etc - the read fails to map.

iqbal-lab commented 8 years ago

Fixed in 299114fa4acb2336bf3797d60b82f079988430ff