unum-cloud / usearch

Fast Open-Source Search & Clustering engine Ɨ for Vectors & šŸ”œ Strings Ɨ in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram šŸ”
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.27k stars 143 forks source link

Bug: Error on adding to a saved index. #507

Closed GraphicalDot closed 1 month ago

GraphicalDot commented 1 month ago

Describe the bug

While using rust crate for Usearch

I created an index, added multiple vectors to it, and saved it. Upon the user's request to add more code chunks, I loaded the index from disk and attempted to add more vectors, but encountered the following error:

"Reserve capacity ahead of insertions!" }

Steps to reproduce

#[derive(Debug, Serialize, Deserialize, Clone, Hash)]
pub struct Chunk {
    pub chunk_type: String,
    pub content: String,
    pub start_line: usize,
    pub end_line: usize,
    pub file_path: String,
}

impl Eq for Chunk {}

impl PartialEq for Chunk {
    fn eq(&self, other: &Self) -> bool {
        self.content == other.content
    }
}

#[derive(Debug, Serialize, Deserialize)]
pub struct ChunkWithCompressedData {
    pub chunk: Chunk,
    pub compressed_content: String,
    pub embeddings: Vec<f32>,
    pub chunk_id: u64
}

pub fn add_to_index(session_id: &str, chunks_with_data: Vec<ChunkWithCompressedData>) {
    // Load or create the index
    let mut index = load_or_create_index(session_id);

    // Iterate over the chunks and add each embedding to the index
    for chunk_with_data in chunks_with_data {
        match index.add(chunk_with_data.chunk_id, &chunk_with_data.embeddings) {
            Ok(_) => info!("Added chunk {} to index", chunk_with_data.chunk_id),
            Err(err) => error!(
                "Failed to add embeddings of length {} for chunk ID {}: {:?}",
                chunk_with_data.embeddings.len(),
                chunk_with_data.chunk_id,
                err
            ),
        };
    }

    // Save the index after adding all the embeddings
    if let Err(err) = save_index(&index, session_id) {
        error!("Failed to save the index for session {}: {:?}", session_id, err);
    } else {
        info!("Index successfully saved for session: {}", session_id);
    }
}

fn load_or_create_index(session_id: &str) -> Index {
    let options = IndexOptions {
        dimensions: 384, // necessary for most metric kinds, should match the dimension of embeddings
        metric: MetricKind::Cos, // or ::L2sq, ::Cos ...
        quantization: ScalarKind::F32, // or ::F32, ::F16, ::I8, ::B1x8 ...
        connectivity: 0,
        expansion_add: 0,
        expansion_search: 0,
        multi: false,
    };

    let index: Index = new_index(&options).unwrap();

    let home_directory = dirs::home_dir().unwrap();
    let root_pyano_dir = home_directory.join(".pyano");
    let pyano_data_dir = root_pyano_dir.join("indexes");

    if !pyano_data_dir.exists() {
        fs::create_dir_all(&pyano_data_dir).unwrap();
    }

    let index_name = format!("{}.usearch", session_id);
    let index_path = pyano_data_dir.join(index_name);
    let index_path_str = index_path.display().to_string();

    match index.load(&index_path_str) {
        Ok(_) => {
            info!("Loaded existing index for session: {}", session_id);
        }
        Err(err) => {
            index.reserve(10000000);
            info!("Index load failed for session: {} with error {}", session_id, err);
        }
    };

    index
}

fn save_index(index: &Index, session_id: &str) -> Result<(), String> {
    let home_directory = dirs::home_dir().unwrap();
    let root_pyano_dir = home_directory.join(".pyano");
    let pyano_data_dir = root_pyano_dir.join("indexes");

    let index_name = format!("{}.usearch", session_id);
    let index_path = pyano_data_dir.join(index_name);
    let index_path_str = index_path.display().to_string();

    match index.save(&index_path_str) {
        Ok(_) => {
            info!("Index successfully saved for session: {}", session_id);
            Ok(())
        }
        Err(err) => Err(format!(
            "Failed to save the index for session {}: {:?}",
            session_id, err
        )),
    }
}

The embeddings are basically generated from all-MiniLM-L6-v2 LLM and is of size 384. After generating the embeddings I made an index and save it using this function defined above . add_to_index(session_id, chunks_with_compressed_data);

When I am calling this function again with the same session_id, it is loading the saved index and trying to add embeddings to it which is when I am getting this error.

"Reserve capacity ahead of insertions!"

Expected behavior

Adding embeddings to the save index should have worked like a charm.

USearch version

2.15.3

Operating System

MacOsX

Hardware architecture

Arm

Which interface are you using?

Other bindings

Contact Details

houzier.saurav@gmail.com

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

Q3g commented 1 month ago

It seems that capacity information is not stored in the file, so after loading, the capacity always equals the count. As a result, an additional reserve is required before insertion, which is the issue Iā€™m encountering. In my scenario, this introduces extra performance overhead. If there is a solution, please let me know.

GraphicalDot commented 1 month ago

I solved this issue by reserving the capacity again after loading the index.

fn load_or_create_index(session_id: &str) -> Index {
    let options = IndexOptions {
        dimensions: 384, // necessary for most metric kinds, should match the dimension of embeddings
        metric: MetricKind::Cos, // or ::L2sq, ::Cos ...
        quantization: ScalarKind::F32, // or ::F32, ::F16, ::I8, ::B1x8 ...
        connectivity: 0,
        expansion_add: 0,
        expansion_search: 0,
        multi: false,
    };

    let index: Index = new_index(&options).unwrap();

    let home_directory = dirs::home_dir().unwrap();
    let root_pyano_dir = home_directory.join(".pyano");
    let pyano_data_dir = root_pyano_dir.join("indexes");

    if !pyano_data_dir.exists() {
        fs::create_dir_all(&pyano_data_dir).unwrap();
    }

    let index_name = format!("{}.usearch", session_id);
    let index_path = pyano_data_dir.join(index_name);
    let index_path_str = index_path.display().to_string();

    match index.load(&index_path_str) {
        Ok(_) => {
            info!("Loaded existing index for session: {}", session_id);
        }
        Err(err) => {
            info!("Index load failed for session: {} with error {}", session_id, err);
        }
    };
    index.reserve(10000000);
    index
}

Third last line is reserving the capacity again after loading the index. You were right @Q3g . Thanks a ton!