Open ifiddes opened 4 years ago
You might try nw_trace_scan_sat
, where "sat" stands for "saturate". It will try the _8 first, then _16, then _32. It will try to detect integer overflow during the calculation, where the bits for an integer fully saturate (all 1's). Unfortunately, if saturation is detected near the end of the _16 attempt, that's a lot of wasted work. So your performance might suffer.
Honestly, _32 should the widest bit you need for most alignments. The _64 will be quite slow because most CPU ISAs don't have a complete set of 64-bit integer vector operations so I'm forced to emulate them with something slower.
A good heuristic would be to estimate what the biggest score would be if your two sequences were to align perfectly. Perfectly would be max(length(A), length(B)) * match score. If that solution fits into a 16-bit integer, you're probably okay. A 16-bit integer can store up to 216 = 65536. A 32-bit integer can store up to 232 = 2147483647. Your longest sequence was 101,000 characters, already longer than a 16-bit value can store.
I tried aligning a 81kb sequence to a 101kb sequence with
nw_trace_scan_16
and the resulting CIGAR is just1X
. If I repeat this withnw_trace_scan_32
, it works fine. Is this intended behavior?Is there any heuristics I can use to predict which bit size I can use based on the length of the sequences?