jeffdaily / parasail-python

Python bindings for the parasail C library.
Other
87 stars 17 forks source link

Silent failure with nw_trace_scan_16 for larger sequences #46

Open ifiddes opened 4 years ago

ifiddes commented 4 years ago

I tried aligning a 81kb sequence to a 101kb sequence with nw_trace_scan_16 and the resulting CIGAR is just 1X. If I repeat this with nw_trace_scan_32, it works fine. Is this intended behavior?

Is there any heuristics I can use to predict which bit size I can use based on the length of the sequences?

jeffdaily commented 4 years ago

You might try nw_trace_scan_sat, where "sat" stands for "saturate". It will try the _8 first, then _16, then _32. It will try to detect integer overflow during the calculation, where the bits for an integer fully saturate (all 1's). Unfortunately, if saturation is detected near the end of the _16 attempt, that's a lot of wasted work. So your performance might suffer.

Honestly, _32 should the widest bit you need for most alignments. The _64 will be quite slow because most CPU ISAs don't have a complete set of 64-bit integer vector operations so I'm forced to emulate them with something slower.

A good heuristic would be to estimate what the biggest score would be if your two sequences were to align perfectly. Perfectly would be max(length(A), length(B)) * match score. If that solution fits into a 16-bit integer, you're probably okay. A 16-bit integer can store up to 216 = 65536. A 32-bit integer can store up to 232 = 2147483647. Your longest sequence was 101,000 characters, already longer than a 16-bit value can store.