ashvardanian commented 6 months ago

USearch implementation had 2 layers, the core HNSW structure implemented in index.hpp and the high-level wrapper for dense equidimensional vectors in index_dense.hpp. In this release, we've made the top layer thinner and cleaner, also making this APIs more similar, error-handling more consistent, and builds faster.

Reducing Dependencies & Accelerating Builds

Previously index_dense.hpp had the following STL includes:

#include <thread> // `std::thread`
#include <functional> // `std::function`
#include <vector> // `std::vector`
#include <numeric> // `std::iota`

Those are some of the most common includes in C++ projects, also the ones I like the least. Here are a couple of reasons to hate them, taken from the "C++ Compile Health Watchdog":

name	compilation time	lines of code	binary size
`<functional>`	82 .. 228 ms	12.9 .. 27.4 kLoC	0 .. 141 kB
`<vector>`	32 .. 48 ms	7.1 .. 8.0 kLoC	0 .. 8.2 kB
`<numeric>`	7 .. 13 ms	1.6 .. 2.1 kLoC	0 .. 3.3 kB
`<thread>`	110 .. 189 ms	17.5 .. 20.3 kLoC	0 .. 153 kB

Reduced Memory Consumption for DBMS-like Users

Most databases using USearch, would prefer to have a smaller index at the cost of some functionality. As mentioned by @rschu1ze, if enable_key_lookups is disabled, and some external DBMS is responsible for "key to vector" mappings, the memory consumption of the index_dense_gt can be further reduced.

Other Improvements

In update-heavy workloads, it's possible to encounter an undefined behavior, leading to a corrupt index state.
In aligned-alloc an integer overflow was resolved by @antonio2368.
Search is now exhaustive by default, so if you have ≥10 items in the index and ask for 10 closest vectors, you are expected to get 10, not ≤10.
Extended get, loadFromPath, and viewFromPath APIs for Java by @mccullocht.
On GCC 12 and older the #pragma region warning was resolved by @mbautin.
Casts to uint40_t slot IDs have been fixed by @Ngalstyan4.
Casts to b1x8_t vectors have been fixed by @Ngalstyan4.
Python type annotation has been added by @jamesbraza.
C++ example was patched by @SheldonFung98.
JavaScript duplicate exports patched by @eryue0220 & @johnhorsema.
Checking the number of dimensions in Rust by @jbrummack.
Marking Rust index views unsafe with @bddap.
Documenting reserve-save-load behavior with @GoodKnight.
Patching argument types in C# tests with @brittlewis12.

rschu1ze commented 6 months ago

@ashvardanian I am happy to see this PR getting merged but it is getting bigger and bigger :-) Question: could we get the stability fixes + the reduced mem consumption commits merged as separate PRs into Usearch's main branch? None of them seem to be breaking, and it would be nice to accelerate things a bit.

ashvardanian commented 6 months ago

Agreed @rschu1ze, will merge soon.

ashvardanian commented 6 months ago

@rschu1ze, there is a weird behavior on very large queries, where the distance to matches isn't monotonically decreasing in some cases. The new extended tests catch it, but I am not sure if that issue was present in the past. I'd really like at least that one issue to be resolved before merging.

Everyone is welcome to join the bug hunt 🤗

Ngalstyan4 commented 6 months ago

422 might be related. It had caused similar symptoms in lantern - distances would not be monotonically decreasing in index search results.

This was caught in the development branch of lantern, once we added quantization support and tests:

-- create an 8-bit quantized index
CREATE INDEX ind8 ON sift_base1k USING lantern_hnsw (v_scaled) WITH (dim=128, M=8, quant_bits=8);
-- look for vectors closest to the vector with id 42
SELECT id, v_scaled <-> :'v_scaled42' as dist FROM sift_base1k ORDER BY 2 LIMIT 10;

Below is the diff with the bug fix

diff -u /lantern_shared/test/expected/hnsw_sq.out /tmp/lantern/tmp_output/results/hnsw_sq.out
--- /lantern_shared/test/expected/hnsw_sq.out   2024-05-24 04:43:03.172237129 +0000
+++ /tmp/lantern/tmp_output/results/hnsw_sq.out 2024-05-25 03:18:11.796896512 +0000
@@ -106,15 +106,15 @@
  id  |   dist
 -----+-----------
   42 |         0
- 285 |   16.7111
- 261 |   16.0935
- 195 | 17.061296
-  48 | 5.1038003
-  50 | 11.509201
   36 |     1.053
-  46 | 10.790699
- 216 | 18.992302
+  48 | 5.1038003
   39 |  5.626501
+ 886 |  7.163699
+ 402 | 7.7013006
+ 518 |  8.502399
+ 331 |  8.779598
+ 340 |  9.726098
+  46 | 10.790699
 (10 rows)

ashvardanian commented 6 months ago

Thanks for the patch, @Ngalstyan4! I've merged it, but it doesn't solve the issue we have with tests right now. We still have a failing assertion at cpp/test.cpp:426 where we check that the output distances are monotonically increasing for long-tail queries, like getting top-500 closest neighbors. Help always appreciated 🤗

unum-cloud / usearch

Error Handling, Update Stability, Improved Java SDK #402

Reducing Dependencies & Accelerating Builds

Reduced Memory Consumption for DBMS-like Users

Other Improvements

422 might be related. It had caused similar symptoms in lantern - distances would not be monotonically decreasing in index search results.