ParentJoin Benchmarks for KNN Search

vigyasharma commented 2 months ago

Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

This change leverages Lucene's DiversifyingChildrenFloatKnnVectorQuery, using exactSearch() for baseline, and approximateSearch() for knn search. Recall is computed by calculating overlap between the two.

Note: We can use the infer_token_vectors_cohere.py script to generate the parentJoin metadata file for Cohere embeddings dataset.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

__

Sample Run Results

# parent join with quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
Sep 06, 2024 2:37:08 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 167 seconds
reindex takes 167694 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 13631 ms: 73 QPS CPU time=13424ms
checking results
SUMMARY: 0.015  13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.158    3.96   253804  20      32      50      4 bits  100     25213   1.00    post-filter
0.162    4.00   253804  20      32      50      7 bits  100     24277   1.00    post-filter
0.015   13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

# parentJoin without quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Sep 06, 2024 2:43:01 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 27 seconds
reindex takes 27412 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 6307 ms: 158 QPS CPU time=6224ms
checking results
SUMMARY: 0.167   6.22   253804  20      32      50      no      100     27412   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.167    6.22   253804  20      32      50      no      100     27412   1.00    post-filter

# default run (no parentJoin)
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Sep 06, 2024 2:49:42 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
Done indexing 25000 documents.
Done indexing 50000 documents.
Done indexing 75000 documents.
Done indexing 100000 documents.
Done indexing 125000 documents.
Done indexing 150000 documents.
Done indexing 175000 documents.
Done indexing 200000 documents.
Done indexing 225000 documents.
Done indexing 250000 documents.
reindex takes 86058 ms
SUMMARY: 0.004   3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.565    1.89   250000  20      32      50      4 bits  4610    43426   1.00    post-filter
0.820    1.62   250000  20      32      50      7 bits  4198    43593   1.00    post-filter
0.004    3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

mikemccand commented 2 months ago

Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.

We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

How do I get the source (vectors) file input to run this?

vigyasharma commented 2 months ago

Thanks for the prompt review @mikemccand

I'm very curious where/how I can get the parent/join meta file to try running this myself...

We can use the python src/python/infer_token_vectors_cohere.py script. We had merged in a change earlier (#283), to update the tool to create a metadata file as well.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

vigyasharma commented 2 months ago

Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details..

reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098  0.725   101323  10      6       32      50      no      9       14.05   12.76   1       295.02  1.00    post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   2   3   3   3   3   3   4   4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
%   0  10  20  30  40  50  60  70  80  90 100
    0   3   5   6   7   7   8   9  10  11  16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   5   9  13  32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   3   3   3   3   3   3  64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96

Results:
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.098         0.725  101323    10       6       32         50         no    14.05          12.76             1           295.02

mikemccand commented 2 months ago

Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking!

mikemccand / luceneutil

ParentJoin Benchmarks for KNN Search #296

Sample Run Results