In the current implementation we use samplers to calculate evaluation metrics on a small subset of the dataset. This can give slightly different scores due to the random state in sampling. It's always possible to seed RNGs for reproduceable results, but there might be cases where we are extremely lucky or extremely unlucky based on the chosen seed. It's still fair to compare different checkpoints with seeded evaluators, but we cannot be sure whether we overestimate or underestimate the performance of all the checkpoints.
Possible solution
Add an option to enable multiple passes over the data and report the mean and STD of all passes, or
Accept an optional QdrantClient and if it is None use Qdrant as the backend to store embeddings and retrieve from.
Problem
In the current implementation we use samplers to calculate evaluation metrics on a small subset of the dataset. This can give slightly different scores due to the random state in sampling. It's always possible to seed RNGs for reproduceable results, but there might be cases where we are extremely lucky or extremely unlucky based on the chosen seed. It's still fair to compare different checkpoints with seeded evaluators, but we cannot be sure whether we overestimate or underestimate the performance of all the checkpoints.
Possible solution
QdrantClient
and if it isNone
use Qdrant as the backend to store embeddings and retrieve from.