Refactor SIMD code and add ARM Neon nibble2base() implementation

jmarshall commented 3 months ago

This PR moves nibble2base_ssse3() from an internal header file to a new source file, simd.c, and adds an ARM implementation nibble2base_neon() alongside. Because these functions are always called via a pointer, they are never inlined so there is nothing gained (other than build complexity!) by having them in a header.

I initially translated the SSSE3 algorithm directly to Neon, during which I noticed some tweaks that could be usefully applied to the SSSE3 version — see the “Minor improvements” commit below. I then realised that the Neon implementation could be simplified by using Neon's rather excellent interleaved load/store instructions, which I then replaced by an in-register zip as it is (sadly) faster (on aarch64 at least). (It may be possible to apply a similar approach for Intel via the unpack instruction.)

Indicative timings for various versions of the Neon implementation, on ARM and, for interest, an iteration count on x86_64 that makes the nibble2base_default time comparable — numbers are comparable vertically, but not directly horizontally:

Apple M2  i5 SSSE3
 seconds   seconds

 262.389   916.215   nibble2base_single (i.e., scalar code doing one nucleotide at a time)
 186.283   186.778   nibble2base_default (i.e., scalar code doing two nucleotides at a time)
  36.053    46.700   Ruben's nibble2base_ssse3 algorithm
  24.783             interleaved using vst2q_u8
  19.287             interleaved using vzipq_u8

I moved the existing code to the new source file in a pair of commits marked as authored by @rhpvorderman and @daviesrob, so that git blame preserves the authorship of their untouched lines. Hence ideally this PR would be merged via a merge commit rather than squashed.

jkbonfield commented 3 months ago

That's impressive. On my intel machine SSSE3 appears to be running about 4x faster than the default version. You're getting 8-9x speed up which is great.

PS. The mythical htslib v2.0 will not use nibble encoding in memory. It's a royal pain in the derriere!

jmarshall commented 3 months ago

I updated the timings to values from the final code and test code in this PR (previously the timings were from earlier development code), included the 1-nibble-at-a-time naive scalar code timings, and added some x86_64 timings on a different iteration count that makes the times on one row comparable. (Not totally sure I believe that 916 seconds! And I only ran it once. But it's indicative that the difference between the two scalar versions is much bigger on x86_64.)

It's not really meaningful to compare between the columns, but what it does tell you is this: the 2-nibbles-at-a-time scalar code is a much smaller win over 1-nibble-at-a-time on ARM than on x86_64. So this suggests that the apparent 8-9x speed up might not be that astonishing, because there is a lot left on the table by the 2-at-a-time scalar code on ARM, at least as it is written here. Or something — who knows, it's a nice speed up anyway :smile:

jkbonfield commented 3 months ago

Yep, very nice and thank you.

I hunted the origin of the nibble2base and backtracked it to my initial BAM implementation in io_lib in Jan 2013 :-) It's since been tarted up a bit and avoiding the need for unaligned access ifdefs by use of memcpy, but essentially the same deal. It got retrofitted into the BAM handling code in htslib shortly after CRAM was added.

https://github.com/jkbonfield/io_lib/blob/master/io_lib/bam.c#L3650-L3657

rhpvorderman commented 3 months ago

It may be possible to apply a similar approach for Intel via the unpack instruction.

I think it would be possible, and possibly more optimal. Interesting find!

rhpvorderman commented 3 months ago

Did a quick test in Sequali, the following code is simpler and all tests pass:

static void 
decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) 
{

    static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
    const uint8_t *dest_end_ptr = dest + length;
    uint8_t *dest_cursor = dest;
    const uint8_t *encoded_cursor = encoded_sequence;
    const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i) - 1);
    __m128i nuc_lookup_vec = _mm_lddqu_si128((__m128i *)nuc_lookup);
    /* Nucleotides are encoded 4-bits per nucleotide and stored in 8-bit bytes 
       as follows: |AB|CD|EF|GH|. The 4-bit codes (going from 0-15) can be used 
       together with the pshufb instruction as a lookup table. The most efficient
       way is to use bitwise AND and shift to create two vectors. One with all 
       the upper codes (|A|C|E|G|) and one with the lower codes (|B|D|F|H|). 
       The lookup can then be performed and the resulting vectors can be 
       interleaved again using the unpack instructions. */
    while (dest_cursor < dest_vec_end_ptr) {
        __m128i encoded = _mm_lddqu_si128((__m128i *)encoded_cursor);
        __m128i encoded_upper = _mm_srli_epi64(encoded, 4);
        encoded_upper = _mm_and_si128(encoded_upper, _mm_set1_epi8(15));
        __m128i encoded_lower = _mm_and_si128(encoded, _mm_set1_epi8(15));
        __m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, encoded_upper);
        __m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, encoded_lower);
        __m128i first_nucleotides = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
        __m128i second_nucleotides = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
        _mm_storeu_si128((__m128i *)dest_cursor, first_nucleotides);
        _mm_storeu_si128((__m128i *)(dest_cursor + 16), second_nucleotides);
        encoded_cursor += sizeof(__m128i);
        dest_cursor += 2 * sizeof(__m128i);
    }
    decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
}

Thanks @jmarshall! This is a great improvement codewise!

EDIT: (Of course I will contribute this after this is merged, or feel free to incorporate it right away).

daviesrob commented 3 months ago

When building on my Mac without using ./configure first, I get:

simd.c:63:19: warning: unused function 'cpu_supports_neon' [-Wunused-function]
static inline int cpu_supports_neon(void) {

This is because use of the function is gated by defined __ARM_NEON but its definition is not. Presumably this warning would also appear when building on an ARM variant that doesn't have NEON.

jmarshall commented 3 months ago

Oops. I think it was more because on ARM non-configure builds, it wasn't building nibble2base_neon() because HAVE_ATTRIBUTE_CONSTRUCTOR wasn't defined. Fixed now.

Probably in theory the definition of cpu_supports_neon() should be gated by #if defined BUILDING_SIMD_NIBBLE2BASE || defined BUILDING_SIMD_WHATEVER_ELSE || … but life is short and ARM compilers should be supporting building these functions even if the eventual worker CPU doesn't support running them…

daviesrob commented 3 months ago

Cheers, that's fixed the problem.

samtools / htslib

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795