bug: SimilaritySearch with scoreThrehold not works with PgVector.

AnthonyDasse commented 3 months ago

Hello.

I think there are an issue with the function SimilaritySearch (with scoreThrehold option).

When I use the SimilaritySearch function with PgVector and I add the 'scoreThreshold' option to 0.80, I have no documents returned. If I remove the 'scoreThreshold' option, I have many documents returned with a score greater than 0.80.

According to my research and the github issues of the langhchain library, the problem comes from a confusion between the distance strategy.

See the issues:

I think the error is around this part of the code : langchaingo/vectorstores/pgvector if scoreThreshold != 0 { whereQuerys = append(whereQuerys, fmt.Sprintf("data.distance < %f", 1-scoreThreshold)) }

If we look at the pgvector document :

the sql request should be like this : https://github.com/pgvector/pgvector

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

chew-z commented 2 months ago

No one answered you so I will try...

It seems to me that langchaingo can't handle newer embedding models. It deepends on github.com/pkoukk/tiktoken-go v 0.1.6 which is good only for older text-embedding-ada-002 embeddings. Newer embeddings models are handled since 0.1.7... You can't use text-embedding-3-large by definition and results with text-embedding-3-small seem suspicious to me.

That's only my experience. I should have investigated further but ...

Since vector search is so critical I would not recommend depending on langachaingo but writing this for yourself. It is really just one database function and one small golang function after all.

chew-z commented 2 months ago

CREATE OR REPLACE FUNCTION public.find_closest_vector(input_vector VECTOR(1536), limit_results INT, filename VARCHAR, collection_name VARCHAR)
RETURNS TABLE(
    doc VARCHAR,
    similarity DOUBLE PRECISION
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        c.document,
        (1 - (c.embedding <=> input_vector)) AS similarity
    FROM
        langchain_pg_embedding c
    INNER JOIN langchain_pg_collection col ON c.collection_id = col.uuid
    WHERE
        col.name = collection_name AND
        c.cmetadata ->> 'filename' = filename
    ORDER BY
        c.embedding <=> input_vector -- This operator calculates the cosine distance
    LIMIT limit_results;
END;
$$ LANGUAGE plpgsql;

type VectorSearchResult struct {
    Document   pgtype.Text   `db:"doc"`
    Similarity pgtype.Float8 `db:"similarity"`
}

// VectorSearch queries the database for the closest vector to the given vector, with the specified limit and filename.
// It returns a slice of VectorSearchResult structs.
func VectorSearch(dbPool *pgxpool.Pool, vector *[]float32, limit int, filename string) []VectorSearchResult {
    // Create a new vector using the given vector slice
    v := pgvector.NewVector(*vector)

    // Execute the query using the dbPool and the vector, limit, filename, and collection name as parameters
    rows, err := dbPool.Query(ctx, "SELECT doc, similarity FROM public.find_closest_vector($1, $2, $3, $4)", v, limit, filename, PGCOLLECTION)
    if err != nil {
        log.Println("error while executing query - ", err)
    }

    // Collect the rows into a slice of VectorSearchResult structs
    result, err := pgx.CollectRows(rows, pgx.RowToStructByNameLax[VectorSearchResult])
    if err != nil {
        log.Printf("CollectRows error: %s", err.Error())
    }

    // Return the result
    return result
}

AnthonyDasse commented 1 month ago

thank you @chew-z , i look that

tmc / langchaingo

bug: SimilaritySearch with scoreThrehold not works with PgVector. #974