unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.92k stars 109 forks source link

Feature fix definition of the inner product distance #419

Open kmkolasinski opened 1 month ago

kmkolasinski commented 1 month ago

This PR is related to this thread https://github.com/unum-cloud/usearch/issues/405

It proposed a simple fix for Inner Product distance definition

P = - sum(a[i] * b[i])
kmkolasinski commented 1 month ago

Now, I think the problem is somewhere else, inside simsimd package

import simsimd

vf16 = np.array([5.0, 5.0, 5.0], dtype=np.float16)
vi8 = np.array([5, 5, 5], dtype=np.int8)

np.array(simsimd.dot(vf16, vf16)), np.array(simsimd.dot(vi8, vi8))
>> (array(75.), array(1.78813934e-07))
kmkolasinski commented 1 month ago

another place which is probably causing confusion

#if SIMSIMD_TARGET_NEON
        if (viable & simsimd_cap_neon_k)
            switch (kind) {
            case simsimd_metric_dot_k: *m = (m_t)&simsimd_cos_i8_neon, *c = simsimd_cap_neon_k; return;
            case simsimd_metric_cos_k: *m = (m_t)&simsimd_cos_i8_neon, *c = simsimd_cap_neon_k; return;
            case simsimd_metric_l2sq_k: *m = (m_t)&simsimd_l2sq_i8_neon, *c = simsimd_cap_neon_k; return;
            default: break;
            }
#endif
#if SIMSIMD_TARGET_ICE
        if (viable & simsimd_cap_ice_k)
            switch (kind) {
            case simsimd_metric_dot_k: *m = (m_t)&simsimd_cos_i8_ice, *c = simsimd_cap_ice_k; return;
            case simsimd_metric_cos_k: *m = (m_t)&simsimd_cos_i8_ice, *c = simsimd_cap_ice_k; return;
            case simsimd_metric_l2sq_k: *m = (m_t)&simsimd_l2sq_i8_ice, *c = simsimd_cap_ice_k; return;
            default: break;
            }
#endif
#if SIMSIMD_TARGET_HASWELL
        if (viable & simsimd_cap_haswell_k)
            switch (kind) {
            case simsimd_metric_dot_k: *m = (m_t)&simsimd_cos_i8_haswell, *c = simsimd_cap_haswell_k; return;
            case simsimd_metric_cos_k: *m = (m_t)&simsimd_cos_i8_haswell, *c = simsimd_cap_haswell_k; return;
            case simsimd_metric_l2sq_k: *m = (m_t)&simsimd_l2sq_i8_haswell, *c = simsimd_cap_haswell_k; return;
            default: break;
            }
#endif

        if (viable & simsimd_cap_serial_k)
            switch (kind) {
            case simsimd_metric_dot_k: *m = (m_t)&simsimd_cos_i8_serial, *c = simsimd_cap_serial_k; return;
            case simsimd_metric_cos_k: *m = (m_t)&simsimd_cos_i8_serial, *c = simsimd_cap_serial_k; return;
            case simsimd_metric_l2sq_k: *m = (m_t)&simsimd_l2sq_i8_serial, *c = simsimd_cap_serial_k; return;
            default: break;
            }

        break;

cos is used for dot ?

ashvardanian commented 1 month ago

Yes, that was historically the case, but that should probably be changed. I was planning to look into it next week. It may require new kernels, but those should be easy to design by stripping the cosine implementations.

kmkolasinski commented 1 month ago

dot - is not a strict distance metric in math terms, hence probably the confusion in various places in the code

kmkolasinski commented 1 month ago

For int8, I see index.hardware_acceleration == "ice" and there is still something which I don't fully understand. I tested my changes locally by installing usearch from source code, by running

pip install . 

which definitely compiles the usearch code. Maybe the code is already fixed in the main branch as compared to usearch==2.12.0 ?

kmkolasinski commented 1 month ago

Nope to my question above. When running on main branch I get this image then switch to this branch and reinstall usearch image

note also speed change for both runs, is it metric_ip_gt is used only for graph construction ?

kmkolasinski commented 1 month ago

@ashvardanian do you know how it is possible ? I still don't undestand how the kernels are being picked, but it looks like simsimd should be used here, so my changes should not affect the search.

ashvardanian commented 1 month ago

SimSIMD should be used in most Linux builds. But you are right to note, that there is a lot of stuff on the main-dev that hasn't been merged yet 🤗

kmkolasinski commented 2 weeks ago

hi, this probably can be closed, right ?

ashvardanian commented 2 weeks ago

It will be automatically closed, as we merge the branch 🤗

kmkolasinski commented 2 weeks ago

ok, thanks !