Feature: Brute force (KNN) search without index

guyrosin commented 1 year ago

Describe what you are looking for

Please add an option to run brute force (KNN) search - without an index. It can be useful for use cases where a small number of searches has to be done. It's available in both faiss and lance.

Thanks for this awesome library!

Can you contribute to the implementation?

[ ] I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

guy.rosin@gmail.com

Is there an existing issue for this?

[X] I have searched the existing issues

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Arman-Ghazaryan commented 1 year ago

Have you checked this part of README.md? Seems like that functionality is already present. cc @guyrosin

ashvardanian commented 1 year ago

Hey, @guyrosin! Please check the link @Arman-Ghazaryan has shared and let us know if we are missing something. You may also want to explore some other documented functionality, like clustering, depending on your use case.

guyrosin commented 1 year ago

Thank you for the fast response! It's indeed the functionality I was looking for :)

guyrosin commented 1 year ago

@ashvardanian @Arman-Ghazaryan Shouldn't the fair performance comparison be done against faiss' knn() method instead of against using an index? I've just ran a small benchmark (see below), and knn() is most of the time the fastest option... or am I missing something?

To add a few more details, my use case is having between ~100M vectors of 300 dimensions, and a small amount of queries (<10). For now I've been using faiss - running the above-mentioned knn() method on chunks of the data (which is obviously OOM), and finally combining the results using a ResultHeap. I've been recently wondering whether it can be done faster using usearch :)

ashvardanian commented 1 year ago

@guyrosin, interesting result! Let me double check :)

Can you please print index.specs and what kind of hardware you are running on? Wouldn't it be better to store and compare vectors in at least f16 or maybe even i8 representations, assuming you have 100 Million of them?

ashvardanian commented 1 year ago

Please make sure you have upgraded to the most recent version. We have had an intra-day release, that could have significantly affected the performance 🤗

ashvardanian commented 1 year ago

I am also not sure about which figure to look at. I guess you need to look at the "Wall time" and should probably use %%timeit -n 1 -r 10 to reduce variance.

guyrosin commented 1 year ago

Oh sweet speedup indeed - now search() takes ~1s instead of 1.3s in the previous version! And you're right - I guess using smaller representations is one of usearch' advantages over faiss (but wanted to start with a 100% fair comparison)

guyrosin commented 1 year ago

Well, using timeit as you suggested brought much different timings -- and now usearch is indeed the fastest! (even with float32)

guyrosin commented 1 year ago

@ashvardanian when running search() on float16 vectors the code crashes with an illegal hardware instruction (core dumped), any help? (let me know if you prefer we'll continue on a separate issue).

I'm running on an EC2 x86-64 machine with 32GB RAM (with Ubuntu 22.04)

import faiss
import numpy as np
from usearch.index import search

np.random.seed(42)

vector_size = 1024
vectors = np.random.rand(10**5, vector_size).astype("float16")
vector = np.random.rand(1, vector_size).astype("float16")
k = 10

D, I = faiss.knn(vector, vectors, k, metric=faiss.METRIC_INNER_PRODUCT)  # <-- this runs

matches = search(vectors, vector, k, exact=True)  # <-- this crashes

ashvardanian commented 1 year ago

Oh, thanks for the feedback! Can you print detailed CPU specs, I’ll fix in a sec :)

guyrosin commented 1 year ago

I hope that's what you meant:

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf
                         _lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512
                         vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    4 MiB (4 instances)
  L3:                    35.8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

ashvardanian commented 1 year ago

Oh, wow! Seems like this x86 CPU has every AVX-512 subset except for the one we need! A hot fix will take an hour. A proper implementation may take two. And regardless if variant, there is a 2h CI process to update the PyPI image. So if you are mot rushing too much, I’d take the better approach.

guyrosin commented 1 year ago

Oh really! No rush at all - I'll check again tomorrow :) Thanks!

ashvardanian commented 1 year ago

@guyrosin, I've fixed the issue, but it's hard to be confident, assuming I can't access that machine. Please try again tomorrow, but before you do, I'm curious to see your i8 performance on the already downloaded version, vs. the one I'm releasing now. Can you please share those as well?

guyrosin commented 1 year ago

OK, on v2.1.3 with i8 usearch does great:

On v2.2.0 with i8 - similar results:

On v2.2.0 with f16 - everything is slower:

On v2.2.0 with f32 - this is interesting. I understand why faiss is faster on f32 than on f16, but I'm not sure why it's the same case with usearch... WDYT @ashvardanian? In addition, it's weird that usearch performance is similar on i8 and f32, right? (and anyway I'm happy to see usearch is still faster :))

ashvardanian commented 1 year ago

Hey, @guyrosin! I haven't merged the i8 updates yet, I will ping you again soon.

Just to make sure, am I reading the numbers correctly?

	`f32`	`f16`	`i8`
`IndexFlatL2`	3.58 s	6.81 s	4.86 s
`knn`	362 ms	3.6 s	1.52 s
`search`	142 ms	840 ms	141 ms

The f16 is expected to be slower than f32 on small datasets without hardware support, because the functionality is simulated in software. That said, I was hoping for better numbers. Let me try something else 🤗

guyrosin commented 1 year ago

Got it. And yeah all the numbers are right!

ashvardanian commented 1 year ago

Epic! Good for us!

In the meantime, the CI for the i8 and more f16 optimizations is already running. Curious to learn about what kind of speed up you'd get there.

guyrosin commented 1 year ago

OK, with v2.3.0 I've got similar timings - maybe i8 got a bit faster.

f32: 148 ms ± 10.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) f16: 840 ms ± 4.78 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) i8: 134 ms ± 7.78 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) (btw I realized the range of the random integers affects the time - for example that was the measurement for 0-100, while for 0-10^4 it was 145ms)

ashvardanian commented 1 year ago

Oh, thank you for taking the time for benchmarks, @guyrosin! Great to have you with us! It looks interesting and goes against my expectations. If you could print the following line, we can double-check if my new optimizations were triggered at all:

usearch.compiled.hardware_acceleration(dtype=ScalarKind.I8, ndim=1024, metric_kind=MetricKind.IP)

guyrosin commented 1 year ago

Sure thing! Happy to help. I've got a False for that line of code... (also for ScalarKind.F16 and ScalarKind.F32)

ashvardanian commented 1 year ago

I now realize I haven't asked if it's on Linux, Windows, or MacOS?

guyrosin commented 1 year ago

Linux it is (Ubuntu 22.04 if it matters)

ashvardanian commented 1 year ago

Very interesting... I am now on an Arm machine in the cloud, and the right overloads are present.

>>> hardware_acceleration(dtype=ScalarKind.F16, ndim=1024, metric_kind=MetricKind.L2sq)
False
>>> hardware_acceleration(dtype=ScalarKind.F16, ndim=1024, metric_kind=MetricKind.IP)
True
>>> hardware_acceleration(dtype=ScalarKind.I8, ndim=1024, metric_kind=MetricKind.L2sq)
False

Your CPU specs suggest that the following implementations should be used:

For the following CPU flags:

    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
                         onstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf
                         _lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512
                         vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

I'll take a look into this.

ashvardanian commented 1 year ago

I think I've found it - my stupid mistake. Let's see if this pipeline passes. The Linux PyPI images may take 2.5h to build, so speak soon 🤗

ashvardanian commented 1 year ago

@gurgenyegoryan, do you guess why the Linux builds started taking so much longer? Is it because of the increased number of tests we run or another reason?

gurgenyegoryan commented 1 year ago

@gurgenyegoryan, do you guess why the Linux builds started taking so much longer? Is it because of the increased number of tests we run or another reason?

Since Linux builds are run in Docker and the number of tests run has been increased, this may have contributed to the slow build.

guyrosin commented 1 year ago

@ashvardanian The f16 optimization works now! But i8 got degraded: f32: 160 ms ± 9.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) f16: 213 ms ± 2.83 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) i8: 355 ms ± 5.45 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Now hardware_acceleration(dtype=ScalarKind.F32, ndim=1024, metric_kind=MetricKind.IP) returns avx2 for f32 and i8, and avx2+f16 for f16.

ashvardanian commented 1 year ago

I am glad the f16 performance got better, @guyrosin! I will think of possible issues with i8, and let you know.

unum-cloud / usearch