Closed ashvardanian closed 3 months ago
@ashvardanian I am happy to see this PR getting merged but it is getting bigger and bigger :-) Question: could we get the stability fixes + the reduced mem consumption commits merged as separate PRs into Usearch
's main branch? None of them seem to be breaking, and it would be nice to accelerate things a bit.
Agreed @rschu1ze, will merge soon.
@rschu1ze, there is a weird behavior on very large queries, where the distance to matches isn't monotonically decreasing in some cases. The new extended tests catch it, but I am not sure if that issue was present in the past. I'd really like at least that one issue to be resolved before merging.
Everyone is welcome to join the bug hunt 🤗
This was caught in the development branch of lantern, once we added quantization support and tests:
-- create an 8-bit quantized index
CREATE INDEX ind8 ON sift_base1k USING lantern_hnsw (v_scaled) WITH (dim=128, M=8, quant_bits=8);
-- look for vectors closest to the vector with id 42
SELECT id, v_scaled <-> :'v_scaled42' as dist FROM sift_base1k ORDER BY 2 LIMIT 10;
Below is the diff with the bug fix
diff -u /lantern_shared/test/expected/hnsw_sq.out /tmp/lantern/tmp_output/results/hnsw_sq.out
--- /lantern_shared/test/expected/hnsw_sq.out 2024-05-24 04:43:03.172237129 +0000
+++ /tmp/lantern/tmp_output/results/hnsw_sq.out 2024-05-25 03:18:11.796896512 +0000
@@ -106,15 +106,15 @@
id | dist
-----+-----------
42 | 0
- 285 | 16.7111
- 261 | 16.0935
- 195 | 17.061296
- 48 | 5.1038003
- 50 | 11.509201
36 | 1.053
- 46 | 10.790699
- 216 | 18.992302
+ 48 | 5.1038003
39 | 5.626501
+ 886 | 7.163699
+ 402 | 7.7013006
+ 518 | 8.502399
+ 331 | 8.779598
+ 340 | 9.726098
+ 46 | 10.790699
(10 rows)
Thanks for the patch, @Ngalstyan4! I've merged it, but it doesn't solve the issue we have with tests right now. We still have a failing assertion at cpp/test.cpp:426
where we check that the output distances are monotonically increasing for long-tail queries, like getting top-500 closest neighbors. Help always appreciated 🤗
USearch implementation had 2 layers, the core HNSW structure implemented in
index.hpp
and the high-level wrapper for dense equidimensional vectors inindex_dense.hpp
. In this release, we've made the top layer thinner and cleaner, also making this APIs more similar, error-handling more consistent, and builds faster.Reducing Dependencies & Accelerating Builds
Previously
index_dense.hpp
had the following STL includes:Those are some of the most common includes in C++ projects, also the ones I like the least. Here are a couple of reasons to hate them, taken from the "C++ Compile Health Watchdog":
<functional>
<vector>
<numeric>
<thread>
Reduced Memory Consumption for DBMS-like Users
Most databases using USearch, would prefer to have a smaller index at the cost of some functionality. As mentioned by @rschu1ze, if
enable_key_lookups
is disabled, and some external DBMS is responsible for "key to vector" mappings, the memory consumption of theindex_dense_gt
can be further reduced.Other Improvements
get
,loadFromPath
, andviewFromPath
APIs for Java by @mccullocht.#pragma region
warning was resolved by @mbautin.uint40_t
slot IDs have been fixed by @Ngalstyan4.b1x8_t
vectors have been fixed by @Ngalstyan4.