Closed jmarshall closed 2 months ago
That's impressive. On my intel machine SSSE3 appears to be running about 4x faster than the default version. You're getting 8-9x speed up which is great.
PS. The mythical htslib v2.0 will not use nibble encoding in memory. It's a royal pain in the derriere!
I updated the timings to values from the final code and test code in this PR (previously the timings were from earlier development code), included the 1-nibble-at-a-time naive scalar code timings, and added some x86_64 timings on a different iteration count that makes the times on one row comparable. (Not totally sure I believe that 916 seconds! And I only ran it once. But it's indicative that the difference between the two scalar versions is much bigger on x86_64.)
It's not really meaningful to compare between the columns, but what it does tell you is this: the 2-nibbles-at-a-time scalar code is a much smaller win over 1-nibble-at-a-time on ARM than on x86_64. So this suggests that the apparent 8-9x speed up might not be that astonishing, because there is a lot left on the table by the 2-at-a-time scalar code on ARM, at least as it is written here. Or something — who knows, it's a nice speed up anyway :smile:
Yep, very nice and thank you.
I hunted the origin of the nibble2base and backtracked it to my initial BAM implementation in io_lib in Jan 2013 :-) It's since been tarted up a bit and avoiding the need for unaligned access ifdefs by use of memcpy, but essentially the same deal. It got retrofitted into the BAM handling code in htslib shortly after CRAM was added.
https://github.com/jkbonfield/io_lib/blob/master/io_lib/bam.c#L3650-L3657
It may be possible to apply a similar approach for Intel via the unpack instruction.
I think it would be possible, and possibly more optimal. Interesting find!
Did a quick test in Sequali, the following code is simpler and all tests pass:
static void
decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length)
{
static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
const uint8_t *dest_end_ptr = dest + length;
uint8_t *dest_cursor = dest;
const uint8_t *encoded_cursor = encoded_sequence;
const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i) - 1);
__m128i nuc_lookup_vec = _mm_lddqu_si128((__m128i *)nuc_lookup);
/* Nucleotides are encoded 4-bits per nucleotide and stored in 8-bit bytes
as follows: |AB|CD|EF|GH|. The 4-bit codes (going from 0-15) can be used
together with the pshufb instruction as a lookup table. The most efficient
way is to use bitwise AND and shift to create two vectors. One with all
the upper codes (|A|C|E|G|) and one with the lower codes (|B|D|F|H|).
The lookup can then be performed and the resulting vectors can be
interleaved again using the unpack instructions. */
while (dest_cursor < dest_vec_end_ptr) {
__m128i encoded = _mm_lddqu_si128((__m128i *)encoded_cursor);
__m128i encoded_upper = _mm_srli_epi64(encoded, 4);
encoded_upper = _mm_and_si128(encoded_upper, _mm_set1_epi8(15));
__m128i encoded_lower = _mm_and_si128(encoded, _mm_set1_epi8(15));
__m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, encoded_upper);
__m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, encoded_lower);
__m128i first_nucleotides = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
__m128i second_nucleotides = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
_mm_storeu_si128((__m128i *)dest_cursor, first_nucleotides);
_mm_storeu_si128((__m128i *)(dest_cursor + 16), second_nucleotides);
encoded_cursor += sizeof(__m128i);
dest_cursor += 2 * sizeof(__m128i);
}
decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
}
Thanks @jmarshall! This is a great improvement codewise!
EDIT: (Of course I will contribute this after this is merged, or feel free to incorporate it right away).
When building on my Mac without using ./configure
first, I get:
simd.c:63:19: warning: unused function 'cpu_supports_neon' [-Wunused-function]
static inline int cpu_supports_neon(void) {
This is because use of the function is gated by defined __ARM_NEON
but its definition is not. Presumably this warning would also appear when building on an ARM variant that doesn't have NEON.
Oops. I think it was more because on ARM non-configure builds, it wasn't building nibble2base_neon()
because HAVE_ATTRIBUTE_CONSTRUCTOR wasn't defined. Fixed now.
Probably in theory the definition of cpu_supports_neon()
should be gated by #if defined BUILDING_SIMD_NIBBLE2BASE || defined BUILDING_SIMD_WHATEVER_ELSE || …
but life is short and ARM compilers should be supporting building these functions even if the eventual worker CPU doesn't support running them…
Cheers, that's fixed the problem.
This PR moves
nibble2base_ssse3()
from an internal header file to a new source file, simd.c, and adds an ARM implementationnibble2base_neon()
alongside. Because these functions are always called via a pointer, they are never inlined so there is nothing gained (other than build complexity!) by having them in a header.I initially translated the SSSE3 algorithm directly to Neon, during which I noticed some tweaks that could be usefully applied to the SSSE3 version — see the “Minor improvements” commit below. I then realised that the Neon implementation could be simplified by using Neon's rather excellent interleaved load/store instructions, which I then replaced by an in-register zip as it is (sadly) faster (on aarch64 at least). (It may be possible to apply a similar approach for Intel via the unpack instruction.)
Indicative timings for various versions of the Neon implementation, on ARM and, for interest, an iteration count on x86_64 that makes the nibble2base_default time comparable — numbers are comparable vertically, but not directly horizontally:
I moved the existing code to the new source file in a pair of commits marked as authored by @rhpvorderman and @daviesrob, so that
git blame
preserves the authorship of their untouched lines. Hence ideally this PR would be merged via a merge commit rather than squashed.