Open FaisalAliShah opened 1 day ago
Hi @FaisalAliShah
In general, we are not practicing thresholds, since it is indeed hard to select a proper threshold. It can differ significantly not only across various datasets, but even across a single dataset. This question is also more model-specific, rather than tool specific (qdrant / fastembed), I think that there might be a higher chance to find the answer you're looking for in colbert's repository
I’m currently working with ColBERT for document re-ranking and facing challenges in applying a generalized threshold to ColBERT scores across different datasets. Due to the variability in score ranges, it’s difficult to set a fixed threshold for relevance filtering. Unlike typical embedding similarity scores, ColBERT’s late interaction mechanism produces scores that can vary significantly based on query length, token distributions, and dataset characteristics.
I tried using min-max normalization on scores returned for a particular query but turns out even when the search is irrelevant it would give results because I was selecting min_score and max_score from the query responses.
Here are some of the approaches I’ve considered, but each has limitations when applied generally:
Normalizing scores by query length or token count Rescaling scores based on observed min-max values in different datasets Z-score normalization based on empirical mean and variance across datasets Using adaptive thresholds or lightweight classifiers to predict relevance However, each approach tends to be dataset-specific, and I would like a solution that can generalize effectively across datasets. Do you have any recommended strategies for achieving a more standardized scoring range or threshold? Alternatively, is there any built-in functionality planned (or that I might have missed) for scaling or calibrating ColBERT scores in a more generalizable way?
Any guidance or suggestions would be greatly appreciated! I have attached my code snipped below as how I am using it.