unum-cloud / usearch

Fast Open-Source Search & Clustering engine Ɨ for Vectors & šŸ”œ Strings Ɨ in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram šŸ”
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.95k stars 111 forks source link

Bug: Index build is 40% slower when SIMSIMD is enabled on Mac M1 Pro #448

Open breezewish opened 1 week ago

breezewish commented 1 week ago

Describe the bug

Not sure whether this should be a bug, I'm using usearch to insert 60000 vectors Ɨ 784d from the Fashion-MNIST dataset.

And here are some interesting finding with index build time:

Compiler flags: -march=native.

Steps to reproduce

git clone --recurse-submodules https://github.com/breezewish/usearch-bench
cd usearch-bench
cd bench_dataset
wget https://ann-benchmarks.com/fashion-mnist-784-euclidean.hdf5
cd ..
mkdir cmake-build-Release
cd cmake-build-Release
cmake .. -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -GNinja && ninja Main && ./Main

USEARCH_USE_SIMSIMD could be changed in USearch.h.

The index is build as follows. For more details, please refer to https://github.com/breezewish/usearch-bench:


class Index {
   public:
    void build() {
        auto data = getDataset();

        using ImplType =
            unum::usearch::index_dense_gt</* key_at */ uint64_t,
                                          /* compressed_slot_at */ uint32_t>;
        auto metric = unum::usearch::metric_punned_t(
            /* dimensions */ data[0].size(),
            /* metric_kind */ unum::usearch::metric_kind_t::cos_k,
            unum::usearch::scalar_kind_t::f32_k);

        index_ = ImplType::make(metric);

        index_.reserve(data.size());

        for (uint64_t i = 0; i < data.size(); ++i) {
            index_.add(/* key */ i, data[i].data());
        }
    }

   private:
    using ImplType =
        unum::usearch::index_dense_gt</* key_at */ uint64_t,
                                      /* compressed_slot_at */ uint32_t>;

    ImplType index_;
};

Expected behavior

SIMSIMD should be always faster? Not sure whether we could have some inspirations from the compiled result of M1 Pro.

USearch version

2.12.0

Operating System

MacOS

Hardware architecture

Arm

Which interface are you using?

C++ implementation

Contact Details

No response

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 1 week ago

Have you tried the 7g instances? I generally avoid implementing f32 kernels in SimSIMD, but they should be very easy to add for parity, in case you want to contribute šŸ¤—

breezewish commented 1 week ago

@ashvardanian Thank you! I will have a try for 7g. Just wondering currently which kernels are more optimized?

ashvardanian commented 1 week ago

I'd recommend trying f16 and i8. The f32 should be very easy to add.

breezewish commented 1 week ago

@ashvardanian Thank you for the recommendation. I revisited SimSIMD and found out that f32 NEON and f32 SVE implementations are already available for l2sq and cosine distance, and the implementations also look good to me. What kind of further works could be done for further improvements? I could have a try :)

ashvardanian commented 1 week ago

Interesting šŸ¤· Not sure if inlining or something else can explain the duration difference in this case.