Closed vigyasharma closed 2 months ago
Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.
We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.
How do I get the source (vectors) file input to run this?
Thanks for the prompt review @mikemccand
I'm very curious where/how I can get the parent/join meta file to try running this myself...
We can use the python src/python/infer_token_vectors_cohere.py
script. We had merged in a change earlier (#283), to update the tool to create a metadata file as well.
python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>
Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details..
reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098 0.725 101323 10 6 32 50 no 9 14.05 12.76 1 295.02 1.00 post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 2 3 3 3 3 3 4 4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
% 0 10 20 30 40 50 60 70 80 90 100
0 3 5 6 7 7 8 9 10 11 16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 1 1 1 5 9 13 32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 3 3 3 3 3 3 64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.098 0.725 101323 10 6 32 50 no 14.05 12.76 1 295.02
Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking!
Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.
This change leverages Lucene's
DiversifyingChildrenFloatKnnVectorQuery
, usingexactSearch()
for baseline, andapproximateSearch()
for knn search. Recall is computed by calculating overlap between the two.Note: We can use the
infer_token_vectors_cohere.py
script to generate the parentJoin metadata file for Cohere embeddings dataset.__
Sample Run Results