Bug: Index build is 40% slower when SIMSIMD is enabled on Mac M1 Pro

breezewish commented 4 months ago

Describe the bug

Not sure whether this should be a bug, I'm using usearch to insert 60000 vectors × 784d from the Fashion-MNIST dataset.

And here are some interesting finding with index build time:

Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=0: 1.6201e+10 ns
Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=1: 2.3243e+10 ns
AWS m6g.2xlarge, Clang 17, USEARCH_USE_SIMSIMD=0: 3.3660e+10 ns
AWS m6g.2xlarge, Clang 17, USEARCH_USE_SIMSIMD=1: 3.2521e+10 ns

Compiler flags: -march=native.

Steps to reproduce

git clone --recurse-submodules https://github.com/breezewish/usearch-bench
cd usearch-bench
cd bench_dataset
wget https://ann-benchmarks.com/fashion-mnist-784-euclidean.hdf5
cd ..
mkdir cmake-build-Release
cd cmake-build-Release
cmake .. -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -GNinja && ninja Main && ./Main

USEARCH_USE_SIMSIMD could be changed in USearch.h.

The index is build as follows. For more details, please refer to https://github.com/breezewish/usearch-bench:


class Index {
   public:
    void build() {
        auto data = getDataset();

        using ImplType =
            unum::usearch::index_dense_gt</* key_at */ uint64_t,
                                          /* compressed_slot_at */ uint32_t>;
        auto metric = unum::usearch::metric_punned_t(
            /* dimensions */ data[0].size(),
            /* metric_kind */ unum::usearch::metric_kind_t::cos_k,
            unum::usearch::scalar_kind_t::f32_k);

        index_ = ImplType::make(metric);

        index_.reserve(data.size());

        for (uint64_t i = 0; i < data.size(); ++i) {
            index_.add(/* key */ i, data[i].data());
        }
    }

   private:
    using ImplType =
        unum::usearch::index_dense_gt</* key_at */ uint64_t,
                                      /* compressed_slot_at */ uint32_t>;

    ImplType index_;
};

Expected behavior

SIMSIMD should be always faster? Not sure whether we could have some inspirations from the compiled result of M1 Pro.

USearch version

2.12.0

Operating System

MacOS

Hardware architecture

Arm

Which interface are you using?

C++ implementation

Contact Details

No response

Are you open to being tagged as a contributor?

[X] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

[X] I have searched the existing issues

Code of Conduct

[X] I agree to follow this project's Code of Conduct

ashvardanian commented 4 months ago

Have you tried the 7g instances? I generally avoid implementing f32 kernels in SimSIMD, but they should be very easy to add for parity, in case you want to contribute 🤗

breezewish commented 4 months ago

@ashvardanian Thank you! I will have a try for 7g. Just wondering currently which kernels are more optimized?

ashvardanian commented 4 months ago

I'd recommend trying f16 and i8. The f32 should be very easy to add.

breezewish commented 4 months ago

@ashvardanian Thank you for the recommendation. I revisited SimSIMD and found out that f32 NEON and f32 SVE implementations are already available for l2sq and cosine distance, and the implementations also look good to me. What kind of further works could be done for further improvements? I could have a try :)

ashvardanian commented 4 months ago

Interesting 🤷 Not sure if inlining or something else can explain the duration difference in this case.

ashvardanian commented 3 months ago

Hi @breezewish! With 2.13 version the performance degradation may be resolved. Have you tried the more recent releases?

breezewish commented 3 months ago

@ashvardanian I have seen a lot of updates recently in the usearch and SimSIMD, great work! Here are the latest benchmarks:

v2.12.0:

Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=0: 1.6201e+10 ns
Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=1: 2.3243e+10 ns

v2.13.3:

Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=0: 1.5964e+10 ns (environmental differences)
Mac M1 Pro, Apple Clang 15.0.0, USEARCH_USE_SIMSIMD=1: 2.1995e+10 ns

There are still some differences. I will take a look when I have time, as it may reveal further improvements to SimSIMD's hand-written SIMD code.

breezewish commented 3 months ago

So far I discovered that distance fn's performance is very different! It could be one of the reason (but the difference between _SIMD and _Native is small. Still cannot fully explain what happened here):

BM_Distance_Cosine_SIMD     35001693 ns     34692000 ns           19
BM_Distance_Cosine_Serial   58129417 ns     57953545 ns           11
BM_Distance_Cosine_Naive    32211306 ns     32141565 ns           23
BM_Distance_L2_SIMD         33009989 ns     32957545 ns           22
BM_Distance_L2_Serial       53165128 ns     53031000 ns           13
BM_Distance_L2_Native       30661576 ns     30519739 ns           23

_SIMD is using a dynamically-dispatched distance function according to cpuid capability, which is NEON on Mac M1 Pro
_Serial is using SimSIMD's serial distance function
_Native is using usearch's own impl, which is:

template <typename scalar_at = float, typename result_at = scalar_at>
struct metric_l2sq_gt {
    using scalar_t = scalar_at;
    using result_t = result_at;

    inline result_t operator()(scalar_t const *a, scalar_t const *b,
                               std::size_t dim) const noexcept {
        result_t ab_deltas_sq{};
#pragma clang loop vectorize(enable)
        for (std::size_t i = 0; i != dim; ++i) {
            result_t ai = static_cast<result_t>(a[i]);
            result_t bi = static_cast<result_t>(b[i]);
            ab_deltas_sq += square(ai - bi);
        }
        return ab_deltas_sq;
    }
};

static void BM_Distance_L2_Native(benchmark::State &state) {
    auto data = getDataset();

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> dist(
        0, data.size() - 1);

    for (auto _ : state) {
        for (size_t i = 0; i < 60000; i++) {
            metric_l2sq_gt<float, float> metric;
            auto distance = metric(data[dist(rng)].data(),
                                   data[dist(rng)].data(), data[0].size());
            benchmark::DoNotOptimize(distance);
        }
    }
}

After I add _Pragma("clang loop vectorize(enable)") into SimSIMD's serial impl, then SimSIMD's performance looks better:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_Distance_Cosine_SIMD     32163248 ns     31995500 ns           20
BM_Distance_Cosine_Serial   30645734 ns     30497000 ns           23
BM_Distance_Cosine_Naive    32121411 ns     31974409 ns           22
BM_Distance_L2_SIMD         31753371 ns     31721636 ns           22
BM_Distance_L2_Serial       30050377 ns     30011652 ns           23
BM_Distance_L2_Native       30317118 ns     30276913 ns           23

#define SIMSIMD_MAKE_L2SQ(name, input_type, accumulator_type, converter)                                               \
    SIMSIMD_PUBLIC void simsimd_l2sq_##input_type##_##name(simsimd_##input_type##_t const* a,                          \
                                                           simsimd_##input_type##_t const* b, simsimd_size_t n,        \
                                                           simsimd_distance_t* result) {                               \
        simsimd_##accumulator_type##_t d2 = 0;                                                                         \
        _Pragma("clang loop vectorize(enable)") for (simsimd_size_t i = 0; i != n; ++i) {                              \
            simsimd_##accumulator_type##_t ai = converter(a[i]);                                                       \
            simsimd_##accumulator_type##_t bi = converter(b[i]);                                                       \
            d2 += (ai - bi) * (ai - bi);                                                                               \
        }                                                                                                              \
        *result = d2;                                                                                                  \
    }

#define SIMSIMD_MAKE_COS(name, input_type, accumulator_type, converter)                                                \
    SIMSIMD_PUBLIC void simsimd_cos_##input_type##_##name(simsimd_##input_type##_t const* a,                           \
                                                          simsimd_##input_type##_t const* b, simsimd_size_t n,         \
                                                          simsimd_distance_t* result) {                                \
        simsimd_##accumulator_type##_t ab = 0, a2 = 0, b2 = 0;                                                         \
        _Pragma("clang loop vectorize(enable)") for (simsimd_size_t i = 0; i != n; ++i) {                              \
            simsimd_##accumulator_type##_t ai = converter(a[i]);                                                       \
            simsimd_##accumulator_type##_t bi = converter(b[i]);                                                       \
            ab += ai * bi;                                                                                             \
            a2 += ai * ai;                                                                                             \
            b2 += bi * bi;                                                                                             \
        }                                                                                                              \
        *result = ab != 0 ? (1 - ab * SIMSIMD_RSQRT(a2) * SIMSIMD_RSQRT(b2)) : 1;                                      \

This could be one improvement that can be made into SimSIMD.

Clang is generating A64_SIMD instructions: https://godbolt.org/z/5YPY3h9eK

However the difference between _SIMD and _Native is still not as big as what we observe from the BuildIndex benchmark. I'm still trying to figure out what's other differences.

ashvardanian commented 3 months ago

Thanks for recommendations, @breezewish! The _Pragma directive looks nice, but most compilers don't support it, as far as I remember, so I try to be cautious. For f32 vectors, performance parity between SIMD and serial code is to be expected, as the latter is almost always auto-vectorized in simple data-parallel kernels, if you compile with something like -march=native.

ashvardanian commented 3 months ago

The following section of Clang outputs looks like loop unrolling.

        fmul    v18.4s, v6.4s, v6.4s
        fmul    v19.4s, v7.4s, v7.4s
        fmul    v20.4s, v16.4s, v16.4s
        fmul    v21.4s, v17.4s, v17.4s
        fmla    v5.4s, v16.4s, v6.4s
        fmla    v1.4s, v17.4s, v7.4s
        fadd    v4.4s, v4.4s, v18.4s
        fadd    v2.4s, v2.4s, v19.4s
        fadd    v3.4s, v3.4s, v20.4s
        fadd    v0.4s, v0.4s, v21.4s

It's great for microbenchmarks, especially for larger vectors, but is questionable in larger systems. Seems like the case is closed! Thanks for help, @breezewish 😉

unum-cloud / usearch