ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

Make rescue_mate more efficient by returning position in `has_shared_substring`? #286

Open ksahlin opened 1 year ago

ksahlin commented 1 year ago

The pairwise alignment done in rescue_mate cuts out a much larger segment of the reference compared to when we can use seeds. This makes the pairwise alignment much slower for those instances (anecdata). More precisely, we cut out a reference segment of length r/2 + mu + 5*sigma where mu and sigma are fragment size distribution parameters.

Before we rescue, we check whether the mate to be rescued share at least one submer of size 2k/3 with the reference segment using ref_seq.find(submer). Since this operation returns the location of the hit, the location of this hit can be used to cut out the reference appropriately.

Since we already compute the location of the seed, I believe my suggestion comes with close to no compute overhead. We would therefore strictly improve speed (through faster pairwise alignment).

To the potential downsides. The rescue functionality is likely invoked for:

  1. Repetitive regions (due to masked seeds)
  2. In regions containing larger indels and other types of SVs.
  3. Too many mutations/read error between read and reference.

The downsides with proposed approach is that: (i) if we stop at the first seed hit (if many) in repeated regions, the first hit may not lead to the optimal alignment. (ii) If the mate span an indel, we may be able to align over it with our cut out segment of length r/2 + mu + 5*sigma, but not with a shorter reference segment.

ksahlin commented 1 year ago

Related, we could perhaps print timings for ssw alignment (normal) vs ssw alignmnt (rescue) form more fine grained logging. I just looked at the timings for aligning the BIO150 dataset (Illumina 150PE reads from GIAB) to human.

Some relevant lines below. It seems like the rescue alignment amounts to near half of the times we call SSW, and the extension alignment takes the majority of the time (compare to total mapping time below). I am not sure about the library parameters but I think its something like mu=300-350, sigma=30-50, r=150, meaning we cut out about a size 525-675 on the reference.

Total calls to ssw: 237776607
Calls to ksw (rescue mode): 105966142
Total time mapping: 3929.15 s.
Total time base level alignment (ssw): 2269.20 s.

(One could perhaps also enlarge the initial cut-out segment sent to has_shared_substring to find more segments that could be rescued over larger deletions.)