veluca93 / fpnge

Demo of a fast PNG encoder.
Apache License 2.0
88 stars 8 forks source link

Apple Silicon, MacOS AARCH64 (ARM) #29

Closed manticore-projects closed 11 months ago

manticore-projects commented 11 months ago

Greetings.

The code does not compile on AARCH64 since the SSE/AVX intrinsics would depend on NEON. Example:

#if defined(__SSE4_2__)
#include <nmmintrin.h>
#elif defined(__aarch64__)
#include <arm_neon.h>
#endif

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process 4 integers (128bits) with simd
    unsigned int i = 0;
    for (; i + 4 <= count; i += 4) {
#if defined(__SSE4_2__)
        const __m128i vx = _mm_lddqu_si128((const __m128i*)(x + i));
        const __m128i vy = _mm_lddqu_si128((const __m128i*)(y + i));
        const __m128i vz = _mm_add_epi32(vx, vy);
        _mm_storeu_si128((__m128i*)(z + i), vz);
#elif defined(__aarch64__)
        const int32x4_t vx = vld1q_s32(x + i);
        const int32x4_t vy = vld1q_s32(y + i);
        const int32x4_t vz = vaddq_s32(vx, vy);
        vst1q_s32(z + i, vz);
#endif
    }

    // tail loop
    for (; i < count; ++i) {
        z[i] = x[i] + y[i];
    }
}

I have setup a working Github pipeline for compiling and testing this FPNGe on AARCH. But I am not a CPP programmer, would you be able and willing to help me when I have questions on the porting?

manticore-projects commented 11 months ago

The Adler and CRC32 classes could be taken from zlib-ng which features AVX/SSE and NEON:

Support for CPU intrinsics when available

    Adler32 implementation using SSSE3, AVX2, AVX512, AVX512-VNNI, Neon, VMX & VSX
    CRC32-B implementation using PCLMULQDQ, VPCLMULQDQ, ACLE, & IBM Z
    Hash table implementation using CRC32-C intrinsics on x86 and ARM
    Slide hash implementations using SSE2, AVX2, ARMv6, Neon, VMX & VSX
    Compare256 implementations using SSE2, AVX2, Neon, POWER9 & RVV
    Inflate chunk copying using SSE2, SSSE3, AVX, Neon & VSX
veluca93 commented 11 months ago

Adding support for other architectures has been on my todo list for a while, but I haven't managed to find the time to do it yet...

I was thinking of re-using a similar approach to what I did for https://github.com/libjxl/libjxl/blob/main/lib/jxl/enc_fast_lossless.cc.

I am of course willing to review PRs though :)

manticore-projects commented 11 months ago

First small achievement: Simulate an AARCH64 on a X86 host, please follow this discussion if your are interested. I will keep posting any progress there.