qdrant / fastembed

Fast, Accurate, Lightweight Python library to make State of the Art Embedding
https://qdrant.github.io/fastembed/
Apache License 2.0
1.47k stars 105 forks source link

Generalized Thresholding for ColBERT Scores Across Datasets #383

Open FaisalAliShah opened 1 day ago

FaisalAliShah commented 1 day ago

I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.

I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.

Here are some of the approaches I’ve considered, but each has limitations when applied generally:

Normalizing scores by query length or token count Rescaling scores based on observed min-max values in different datasets Z-score normalization based on empirical mean and variance across datasets Using adaptive thresholds or lightweight classifiers to predict relevance However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?

Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.


prefetch = [
                models.Prefetch(
                    query=dense_embedding,
                    using=dense_vector_name,
                    limit=20,
                ),
                models.Prefetch(
                    query=sparse_embedding,
                    using=sparse_vector_name,
                    limit=20,
                ),
            ]

            search_results = self.qdrant_client.query_points(
                collection_name=kwargs["collection_name"],
                prefetch=prefetch if Config.RETRIEVAL_MODE == QdrantSearchEnums.HYBRID.value else None,
                query=dense_embedding if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_embedding,
                using=dense_vector_name if Config.RETRIEVAL_MODE == QdrantSearchEnums.DENSE.value else colbert_vector_name,
                with_payload=True, 
                limit=10,
                # score_threshold=17,
            ).points
            return search_results
joein commented 1 day ago

Hi @FaisalAliShah

In general, we are not practicing thresholds, since it is indeed hard to select a proper threshold. It can differ significantly not only across various datasets, but even across a single dataset. This question is also more model-specific, rather than tool specific (qdrant / fastembed), I think that there might be a higher chance to find the answer you're looking for in colbert's repository