Closed breezewish closed 3 months ago
Have you tried the 7g instances? I generally avoid implementing f32 kernels in SimSIMD, but they should be very easy to add for parity, in case you want to contribute š¤
@ashvardanian Thank you! I will have a try for 7g. Just wondering currently which kernels are more optimized?
I'd recommend trying f16 and i8. The f32 should be very easy to add.
@ashvardanian Thank you for the recommendation. I revisited SimSIMD and found out that f32 NEON and f32 SVE implementations are already available for l2sq and cosine distance, and the implementations also look good to me. What kind of further works could be done for further improvements? I could have a try :)
Interesting š¤· Not sure if inlining or something else can explain the duration difference in this case.
Hi @breezewish! With 2.13 version the performance degradation may be resolved. Have you tried the more recent releases?
@ashvardanian I have seen a lot of updates recently in the usearch and SimSIMD, great work! Here are the latest benchmarks:
v2.12.0:
v2.13.3:
There are still some differences. I will take a look when I have time, as it may reveal further improvements to SimSIMD's hand-written SIMD code.
So far I discovered that distance fn's performance is very different! It could be one of the reason (but the difference between _SIMD and _Native is small. Still cannot fully explain what happened here):
BM_Distance_Cosine_SIMD 35001693 ns 34692000 ns 19
BM_Distance_Cosine_Serial 58129417 ns 57953545 ns 11
BM_Distance_Cosine_Naive 32211306 ns 32141565 ns 23
BM_Distance_L2_SIMD 33009989 ns 32957545 ns 22
BM_Distance_L2_Serial 53165128 ns 53031000 ns 13
BM_Distance_L2_Native 30661576 ns 30519739 ns 23
_SIMD
is using a dynamically-dispatched distance function according to cpuid capability, which is NEON on Mac M1 Pro_Serial
is using SimSIMD's serial distance function_Native
is using usearch's own impl, which is:template <typename scalar_at = float, typename result_at = scalar_at>
struct metric_l2sq_gt {
using scalar_t = scalar_at;
using result_t = result_at;
inline result_t operator()(scalar_t const *a, scalar_t const *b,
std::size_t dim) const noexcept {
result_t ab_deltas_sq{};
#pragma clang loop vectorize(enable)
for (std::size_t i = 0; i != dim; ++i) {
result_t ai = static_cast<result_t>(a[i]);
result_t bi = static_cast<result_t>(b[i]);
ab_deltas_sq += square(ai - bi);
}
return ab_deltas_sq;
}
};
static void BM_Distance_L2_Native(benchmark::State &state) {
auto data = getDataset();
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> dist(
0, data.size() - 1);
for (auto _ : state) {
for (size_t i = 0; i < 60000; i++) {
metric_l2sq_gt<float, float> metric;
auto distance = metric(data[dist(rng)].data(),
data[dist(rng)].data(), data[0].size());
benchmark::DoNotOptimize(distance);
}
}
}
After I add _Pragma("clang loop vectorize(enable)")
into SimSIMD's serial impl, then SimSIMD's performance looks better:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_Distance_Cosine_SIMD 32163248 ns 31995500 ns 20
BM_Distance_Cosine_Serial 30645734 ns 30497000 ns 23
BM_Distance_Cosine_Naive 32121411 ns 31974409 ns 22
BM_Distance_L2_SIMD 31753371 ns 31721636 ns 22
BM_Distance_L2_Serial 30050377 ns 30011652 ns 23
BM_Distance_L2_Native 30317118 ns 30276913 ns 23
#define SIMSIMD_MAKE_L2SQ(name, input_type, accumulator_type, converter) \
SIMSIMD_PUBLIC void simsimd_l2sq_##input_type##_##name(simsimd_##input_type##_t const* a, \
simsimd_##input_type##_t const* b, simsimd_size_t n, \
simsimd_distance_t* result) { \
simsimd_##accumulator_type##_t d2 = 0; \
_Pragma("clang loop vectorize(enable)") for (simsimd_size_t i = 0; i != n; ++i) { \
simsimd_##accumulator_type##_t ai = converter(a[i]); \
simsimd_##accumulator_type##_t bi = converter(b[i]); \
d2 += (ai - bi) * (ai - bi); \
} \
*result = d2; \
}
#define SIMSIMD_MAKE_COS(name, input_type, accumulator_type, converter) \
SIMSIMD_PUBLIC void simsimd_cos_##input_type##_##name(simsimd_##input_type##_t const* a, \
simsimd_##input_type##_t const* b, simsimd_size_t n, \
simsimd_distance_t* result) { \
simsimd_##accumulator_type##_t ab = 0, a2 = 0, b2 = 0; \
_Pragma("clang loop vectorize(enable)") for (simsimd_size_t i = 0; i != n; ++i) { \
simsimd_##accumulator_type##_t ai = converter(a[i]); \
simsimd_##accumulator_type##_t bi = converter(b[i]); \
ab += ai * bi; \
a2 += ai * ai; \
b2 += bi * bi; \
} \
*result = ab != 0 ? (1 - ab * SIMSIMD_RSQRT(a2) * SIMSIMD_RSQRT(b2)) : 1; \
This could be one improvement that can be made into SimSIMD.
Clang is generating A64_SIMD instructions: https://godbolt.org/z/5YPY3h9eK
However the difference between _SIMD and _Native is still not as big as what we observe from the BuildIndex benchmark. I'm still trying to figure out what's other differences.
Thanks for recommendations, @breezewish! The _Pragma
directive looks nice, but most compilers don't support it, as far as I remember, so I try to be cautious. For f32
vectors, performance parity between SIMD and serial code is to be expected, as the latter is almost always auto-vectorized in simple data-parallel kernels, if you compile with something like -march=native
.
The following section of Clang outputs looks like loop unrolling.
fmul v18.4s, v6.4s, v6.4s
fmul v19.4s, v7.4s, v7.4s
fmul v20.4s, v16.4s, v16.4s
fmul v21.4s, v17.4s, v17.4s
fmla v5.4s, v16.4s, v6.4s
fmla v1.4s, v17.4s, v7.4s
fadd v4.4s, v4.4s, v18.4s
fadd v2.4s, v2.4s, v19.4s
fadd v3.4s, v3.4s, v20.4s
fadd v0.4s, v0.4s, v21.4s
It's great for microbenchmarks, especially for larger vectors, but is questionable in larger systems. Seems like the case is closed! Thanks for help, @breezewish š
Describe the bug
Not sure whether this should be a bug, I'm using usearch to insert 60000 vectors Ć 784d from the Fashion-MNIST dataset.
And here are some interesting finding with index build time:
Compiler flags:
-march=native
.Steps to reproduce
USEARCH_USE_SIMSIMD could be changed in
USearch.h
.The index is build as follows. For more details, please refer to https://github.com/breezewish/usearch-bench:
Expected behavior
SIMSIMD should be always faster? Not sure whether we could have some inspirations from the compiled result of M1 Pro.
USearch version
2.12.0
Operating System
MacOS
Hardware architecture
Arm
Which interface are you using?
C++ implementation
Contact Details
No response
Are you open to being tagged as a contributor?
.git
history as a contributorIs there an existing issue for this?
Code of Conduct