Add Cohere/wikipedia-22-12-en-embeddings dataset for vector search

mikemccand / luceneutil

Various utility scripts for running Lucene performance tests

Apache License 2.0

203 stars 114 forks source link

Add Cohere/wikipedia-22-12-en-embeddings dataset for vector search #255

Closed mayya-sharipova closed 8 months ago

mayya-sharipova commented 8 months ago

This dataset contains vectors of higher dimensions of 768. This dataset contains a pre-processed version from Wikipedia suitable for semantic search.

The problem with current datasets generated by infer_token_vectors.py file is that they are generated for single tokens out of context. Thus, they don't represent the true embeddings, and hence the produced recall of vector search on them is not good.

Cohere/wikipedia-22-12-en-embeddings dataset represent true embeddings and we can show very good recall of Lucene vector search on them.

mayya-sharipova commented 8 months ago

For comparison,

Running vector search ( k=10, fanout=90) on a single merged segment:

the existing embedded 10M vectors dataset of 768 dims, produces a low recall of 0.542
cohere 10M vectors dataset of 768 dims, produces recall of 0.929

msokolov commented 8 months ago

that's a nice result! Yeah the method behind creating these test vector datasets was always bothersome.

mikemccand commented 8 months ago

+1, thank you @mayya-sharipova -- should we maybe switch over the nightly benchy to these vetors too?

mayya-sharipova commented 8 months ago

@msokolov Thanks for checking.

@mikemccand Yes, it would be nice to switch the nightly benchmarks to those as well, as I assume the graphs they produce are different from the current embeddings.

mikemccand commented 8 months ago

@mikemccand Yes, it would be nice to switch the nightly benchmarks to those as well, as I assume the graphs they produce are different from the current embeddings.

Super -- I'll open spinoff issue.