Improve evaluation procedure for extensive results

Problem

In the current implementation we use samplers to calculate evaluation metrics on a small subset of the dataset. This can give slightly different scores due to the random state in sampling. It's always possible to seed RNGs for reproduceable results, but there might be cases where we are extremely lucky or extremely unlucky based on the chosen seed. It's still fair to compare different checkpoints with seeded evaluators, but we cannot be sure whether we overestimate or underestimate the performance of all the checkpoints.

Possible solution

Add an option to enable multiple passes over the data and report the mean and STD of all passes, or
Accept an optional QdrantClient and if it is None use Qdrant as the backend to store embeddings and retrieve from.

qdrant / quaterion

Improve evaluation procedure for extensive results #190

Problem

Possible solution