Open ksahlin opened 1 year ago
Related, we could perhaps print timings for ssw alignment (normal) vs ssw alignmnt (rescue) form more fine grained logging. I just looked at the timings for aligning the BIO150 dataset (Illumina 150PE reads from GIAB) to human.
Some relevant lines below. It seems like the rescue alignment amounts to near half of the times we call SSW, and the extension alignment takes the majority of the time (compare to total mapping time below). I am not sure about the library parameters but I think its something like mu=300-350, sigma=30-50, r=150, meaning we cut out about a size 525-675 on the reference.
Total calls to ssw: 237776607
Calls to ksw (rescue mode): 105966142
Total time mapping: 3929.15 s.
Total time base level alignment (ssw): 2269.20 s.
(One could perhaps also enlarge the initial cut-out segment sent to has_shared_substring
to find more segments that could be rescued over larger deletions.)
The pairwise alignment done in
rescue_mate
cuts out a much larger segment of the reference compared to when we can use seeds. This makes the pairwise alignment much slower for those instances (anecdata). More precisely, we cut out a reference segment of lengthr/2 + mu + 5*sigma
where mu and sigma are fragment size distribution parameters.Before we rescue, we check whether the mate to be rescued share at least one submer of size
2k/3
with the reference segment usingref_seq.find(submer)
. Since this operation returns the location of the hit, the location of this hit can be used to cut out the reference appropriately.Since we already compute the location of the seed, I believe my suggestion comes with close to no compute overhead. We would therefore strictly improve speed (through faster pairwise alignment).
To the potential downsides. The rescue functionality is likely invoked for:
The downsides with proposed approach is that: (i) if we stop at the first seed hit (if many) in repeated regions, the first hit may not lead to the optimal alignment. (ii) If the mate span an indel, we may be able to align over it with our cut out segment of length
r/2 + mu + 5*sigma
, but not with a shorter reference segment.