Closed guyrosin closed 1 year ago
Have you checked this part of README.md? Seems like that functionality is already present. cc @guyrosin
Hey, @guyrosin! Please check the link @Arman-Ghazaryan has shared and let us know if we are missing something. You may also want to explore some other documented functionality, like clustering, depending on your use case.
Thank you for the fast response! It's indeed the functionality I was looking for :)
@ashvardanian @Arman-Ghazaryan Shouldn't the fair performance comparison be done against faiss' knn()
method instead of against using an index?
I've just ran a small benchmark (see below), and knn()
is most of the time the fastest option... or am I missing something?
To add a few more details, my use case is having between ~100M vectors of 300 dimensions, and a small amount of queries (<10). For now I've been using faiss - running the above-mentioned knn() method on chunks of the data (which is obviously OOM), and finally combining the results using a ResultHeap. I've been recently wondering whether it can be done faster using usearch :)
@guyrosin, interesting result! Let me double check :)
Can you please print index.specs
and what kind of hardware you are running on? Wouldn't it be better to store and compare vectors in at least f16
or maybe even i8
representations, assuming you have 100 Million of them?
Please make sure you have upgraded to the most recent version. We have had an intra-day release, that could have significantly affected the performance 🤗
I am also not sure about which figure to look at. I guess you need to look at the "Wall time" and should probably use %%timeit -n 1 -r 10
to reduce variance.
Oh sweet speedup indeed - now search()
takes ~1s instead of 1.3s in the previous version!
And you're right - I guess using smaller representations is one of usearch' advantages over faiss (but wanted to start with a 100% fair comparison)
Well, using timeit
as you suggested brought much different timings -- and now usearch is indeed the fastest! (even with float32
)
@ashvardanian when running search()
on float16
vectors the code crashes with an illegal hardware instruction (core dumped)
, any help? (let me know if you prefer we'll continue on a separate issue).
I'm running on an EC2 x86-64 machine with 32GB RAM (with Ubuntu 22.04)
import faiss
import numpy as np
from usearch.index import search
np.random.seed(42)
vector_size = 1024
vectors = np.random.rand(10**5, vector_size).astype("float16")
vector = np.random.rand(1, vector_size).astype("float16")
k = 10
D, I = faiss.knn(vector, vectors, k, metric=faiss.METRIC_INNER_PRODUCT) # <-- this runs
matches = search(vectors, vector, k, exact=True) # <-- this crashes
Oh, thanks for the feedback! Can you print detailed CPU specs, I’ll fix in a sec :)
I hope that's what you meant:
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 7
BogoMIPS: 4999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
onstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf
_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512
vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 4 MiB (4 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Unknown: Dependent on hypervisor status
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Vulnerable
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Oh, wow! Seems like this x86 CPU has every AVX-512 subset except for the one we need! A hot fix will take an hour. A proper implementation may take two. And regardless if variant, there is a 2h CI process to update the PyPI image. So if you are mot rushing too much, I’d take the better approach.
Oh really! No rush at all - I'll check again tomorrow :) Thanks!
@guyrosin, I've fixed the issue, but it's hard to be confident, assuming I can't access that machine. Please try again tomorrow, but before you do, I'm curious to see your i8
performance on the already downloaded version, vs. the one I'm releasing now. Can you please share those as well?
OK, on v2.1.3 with i8
usearch does great:
On v2.2.0 with i8
- similar results:
On v2.2.0 with f16
- everything is slower:
On v2.2.0 with f32
- this is interesting. I understand why faiss is faster on f32
than on f16
, but I'm not sure why it's the same case with usearch... WDYT @ashvardanian?
In addition, it's weird that usearch performance is similar on i8
and f32
, right?
(and anyway I'm happy to see usearch is still faster :))
Hey, @guyrosin! I haven't merged the i8
updates yet, I will ping you again soon.
Just to make sure, am I reading the numbers correctly?
f32 |
f16 |
i8 |
|
---|---|---|---|
IndexFlatL2 |
3.58 s | 6.81 s | 4.86 s |
knn |
362 ms | 3.6 s | 1.52 s |
search |
142 ms | 840 ms | 141 ms |
The f16
is expected to be slower than f32
on small datasets without hardware support, because the functionality is simulated in software. That said, I was hoping for better numbers. Let me try something else 🤗
Got it. And yeah all the numbers are right!
Epic! Good for us!
In the meantime, the CI for the i8
and more f16
optimizations is already running. Curious to learn about what kind of speed up you'd get there.
OK, with v2.3.0 I've got similar timings - maybe i8
got a bit faster.
f32
: 148 ms ± 10.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
f16
: 840 ms ± 4.78 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
i8
: 134 ms ± 7.78 ms per loop (mean ± std. dev. of 10 runs, 1 loop each) (btw I realized the range of the random integers affects the time - for example that was the measurement for 0-100, while for 0-10^4 it was 145ms)
Oh, thank you for taking the time for benchmarks, @guyrosin! Great to have you with us! It looks interesting and goes against my expectations. If you could print the following line, we can double-check if my new optimizations were triggered at all:
usearch.compiled.hardware_acceleration(dtype=ScalarKind.I8, ndim=1024, metric_kind=MetricKind.IP)
Sure thing! Happy to help.
I've got a False
for that line of code... (also for ScalarKind.F16
and ScalarKind.F32
)
I now realize I haven't asked if it's on Linux, Windows, or MacOS?
Linux it is (Ubuntu 22.04 if it matters)
Very interesting... I am now on an Arm machine in the cloud, and the right overloads are present.
>>> hardware_acceleration(dtype=ScalarKind.F16, ndim=1024, metric_kind=MetricKind.L2sq)
False
>>> hardware_acceleration(dtype=ScalarKind.F16, ndim=1024, metric_kind=MetricKind.IP)
True
>>> hardware_acceleration(dtype=ScalarKind.I8, ndim=1024, metric_kind=MetricKind.L2sq)
False
Your CPU specs suggest that the following implementations should be used:
For the following CPU flags:
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
onstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf
_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512
vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
I'll take a look into this.
I think I've found it - my stupid mistake. Let's see if this pipeline passes. The Linux PyPI images may take 2.5h to build, so speak soon 🤗
@gurgenyegoryan, do you guess why the Linux builds started taking so much longer? Is it because of the increased number of tests we run or another reason?
@gurgenyegoryan, do you guess why the Linux builds started taking so much longer? Is it because of the increased number of tests we run or another reason?
Since Linux builds are run in Docker and the number of tests run has been increased, this may have contributed to the slow build.
@ashvardanian The f16
optimization works now! But i8
got degraded:
f32
: 160 ms ± 9.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
f16
: 213 ms ± 2.83 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
i8
: 355 ms ± 5.45 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Now hardware_acceleration(dtype=ScalarKind.F32, ndim=1024, metric_kind=MetricKind.IP)
returns avx2
for f32
and i8
, and avx2+f16
for f16.
I am glad the f16
performance got better, @guyrosin! I will think of possible issues with i8
, and let you know.
Describe what you are looking for
Please add an option to run brute force (KNN) search - without an index. It can be useful for use cases where a small number of searches has to be done. It's available in both faiss and lance.
Thanks for this awesome library!
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
guy.rosin@gmail.com
Is there an existing issue for this?
Code of Conduct