Closed mayya-sharipova closed 8 months ago
For comparison,
Running vector search ( k=10, fanout=90) on a single merged segment:
that's a nice result! Yeah the method behind creating these test vector datasets was always bothersome.
+1, thank you @mayya-sharipova -- should we maybe switch over the nightly benchy to these vetors too?
@msokolov Thanks for checking.
@mikemccand Yes, it would be nice to switch the nightly benchmarks to those as well, as I assume the graphs they produce are different from the current embeddings.
@mikemccand Yes, it would be nice to switch the nightly benchmarks to those as well, as I assume the graphs they produce are different from the current embeddings.
Super -- I'll open spinoff issue.
This dataset contains vectors of higher dimensions of 768. This dataset contains a pre-processed version from Wikipedia suitable for semantic search.
The problem with current datasets generated by
infer_token_vectors.py
file is that they are generated for single tokens out of context. Thus, they don't represent the true embeddings, and hence the produced recall of vector search on them is not good.Cohere/wikipedia-22-12-en-embeddings dataset represent true embeddings and we can show very good recall of Lucene vector search on them.