unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.92k stars 109 forks source link

Error Handling, Update Stability, Improved Java SDK #402

Open ashvardanian opened 2 months ago

ashvardanian commented 2 months ago

USearch implementation had 2 layers, the core HNSW structure implemented in index.hpp and the high-level wrapper for dense equidimensional vectors in index_dense.hpp. In this release, we've made the top layer thinner and cleaner, also making this APIs more similar, error-handling more consistent, and builds faster.

Reducing Dependencies & Accelerating Builds

Previously index_dense.hpp had the following STL includes:

#include <thread> // `std::thread`
#include <functional> // `std::function`
#include <vector> // `std::vector`
#include <numeric> // `std::iota`

Those are some of the most common includes in C++ projects, also the ones I like the least. Here are a couple of reasons to hate them, taken from the "C++ Compile Health Watchdog":

name compilation time lines of code binary size
<functional> 82 .. 228 ms 12.9 .. 27.4 kLoC 0 .. 141 kB
<vector> 32 .. 48 ms 7.1 .. 8.0 kLoC 0 .. 8.2 kB
<numeric> 7 .. 13 ms 1.6 .. 2.1 kLoC 0 .. 3.3 kB
<thread> 110 .. 189 ms 17.5 .. 20.3 kLoC 0 .. 153 kB

Improved Java, C#, and Swift SDK

Thanks to @mccullocht, the Java SDK has been extended with get, loadFromPath, and viewFromPath APIs.

Usage examples to come :)

Other SDK improvements include:

Reduced Memory Consumption for DBMS-like Users

Most databases using USearch, would prefer to have a smaller index at the cost of some functionality. As mentioned by @rschu1ze, if enable_key_lookups is disabled, and some external DBMS is responsible for "key to vector" mappings, the memory consumption of the index_dense_gt can be further reduced.

Improving Stability

rschu1ze commented 1 month ago

@ashvardanian I am happy to see this PR getting merged but it is getting bigger and bigger :-) Question: could we get the stability fixes + the reduced mem consumption commits merged as separate PRs into Usearch's main branch? None of them seem to be breaking, and it would be nice to accelerate things a bit.

ashvardanian commented 1 month ago

Agreed @rschu1ze, will merge soon.

ashvardanian commented 1 month ago

@rschu1ze, there is a weird behavior on very large queries, where the distance to matches isn't monotonically decreasing in some cases. The new extended tests catch it, but I am not sure if that issue was present in the past. I'd really like at least that one issue to be resolved before merging.

Everyone is welcome to join the bug hunt 🤗

Ngalstyan4 commented 1 month ago

422 might be related. It had caused similar symptoms in lantern - distances would not be monotonically decreasing in index search results.

This was caught in the development branch of lantern, once we added quantization support and tests:

-- create an 8-bit quantized index
CREATE INDEX ind8 ON sift_base1k USING lantern_hnsw (v_scaled) WITH (dim=128, M=8, quant_bits=8);
-- look for vectors closest to the vector with id 42
SELECT id, v_scaled <-> :'v_scaled42' as dist FROM sift_base1k ORDER BY 2 LIMIT 10;

Below is the diff with the bug fix

diff -u /lantern_shared/test/expected/hnsw_sq.out /tmp/lantern/tmp_output/results/hnsw_sq.out
--- /lantern_shared/test/expected/hnsw_sq.out   2024-05-24 04:43:03.172237129 +0000
+++ /tmp/lantern/tmp_output/results/hnsw_sq.out 2024-05-25 03:18:11.796896512 +0000
@@ -106,15 +106,15 @@
  id  |   dist
 -----+-----------
   42 |         0
- 285 |   16.7111
- 261 |   16.0935
- 195 | 17.061296
-  48 | 5.1038003
-  50 | 11.509201
   36 |     1.053
-  46 | 10.790699
- 216 | 18.992302
+  48 | 5.1038003
   39 |  5.626501
+ 886 |  7.163699
+ 402 | 7.7013006
+ 518 |  8.502399
+ 331 |  8.779598
+ 340 |  9.726098
+  46 | 10.790699
 (10 rows)
ashvardanian commented 1 month ago

Thanks for the patch, @Ngalstyan4! I've merged it, but it doesn't solve the issue we have with tests right now. We still have a failing assertion at cpp/test.cpp:426 where we check that the output distances are monotonically increasing for long-tail queries, like getting top-500 closest neighbors. Help always appreciated 🤗