unum-cloud / usearch

Fast Open-Source Search & Clustering engine Ɨ for Vectors & šŸ”œ Strings Ɨ in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram šŸ”
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.12k stars 126 forks source link

Feature: support all data types listed in scalar_kind_t in index_dense #469

Open mbautin opened 3 weeks ago

mbautin commented 3 weeks ago

Describe what you are looking for

index_dense seems to only support these data types:

    add_result_t add(vector_key_t key, b1x8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_b1x8); }
    add_result_t add(vector_key_t key, i8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_i8); }
    add_result_t add(vector_key_t key, f16_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f16); }
    add_result_t add(vector_key_t key, f32_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f32); }
    add_result_t add(vector_key_t key, f64_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f64); }

The casts are instantiated only for these 5 types as well.

scalar_kind_t has much more than that:

enum class scalar_kind_t : std::uint8_t {
    unknown_k = 0,
    // Custom:
    b1x8_k = 1,
    u40_k = 2,
    uuid_k = 3,
    // Common:
    f64_k = 10,
    f32_k = 11,
    f16_k = 12,
    f8_k = 13,
    // Common Integral:
    u64_k = 14,
    u32_k = 15,
    u16_k = 16,
    u8_k = 17,
    i64_k = 20,
    i32_k = 21,
    i16_k = 22,
    i8_k = 23,
};

In particular, the SIFT1B dataset from http://corpus-texmex.irisa.fr/ seems to require unsigned char, and building a float-based index for that dataset seems wasteful.

If there is a good reason to only support those 5 data types, some documentation of the rationale would be great.

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

mbautin@gmail.com

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 2 weeks ago

Some of those, indeed, make sense, @mbautin! Implementing all of them sounds like a code-bloat. How about adding only u8 and bf16 to casts_punned_t?

I've just pushed a commit that refactors the code logic making it easier to extend an API. Feel free to open PRs šŸ¤—