Open nitirajrathore opened 1 year ago
Should we commit what we have here even though it's still draft?
Should we commit what we have here even though it's still draft?
Hi @mikemccand. I will do it in parts. First part is to refactor the existing code for which I have raised the PR. Its mostly refactoring so we can immediately commit with someones help. If you find time please review it. : https://github.com/mikemccand/luceneutil/pull/254 Once thats done I will prepare second PR.
CC : @msokolov
Adding this code for someone to checkout and comment but not merge yet. This change picks up changes from earlier pending PR #234. Added
CheckHNSWConnectedness
for finding the connectedness of HNSW graph at each level of the graph and also overall disconnected nodes of the graph.Refactored KnnIndexer out of KnnGraphTester to be able to use standalone and created KnnIndexerMain for it.
Tests
ant vectors300-docs
ant vectors100-docs
./gradlew :src:main:run -PmainClass=knn.KnnIndexerMain --args=" -docvectorspath lucene/benchmarks/data/enwiki-20120502-lines-1k-100d.vec -indexpath lucene/benchmarks/indices/vector_index -maxconn 16 -beamwidth 100 -vectorencoding FLOAT32 -similarityfunction DOT_PRODUCT -numdocs 1000000 -dimension 100"
./gradlew :src:main:checkHnswConnected -Pindex="lucene/benchmarks/indices/vector300_index" -Pknn-field="knn"
The script currently only generated single segment indexes. I will try to create multiple segments and check again. But this proves that the there is disconnectedness in graph even when there is less dynamism in creating and updating documents.
Results
100-Dim 1M-vectors
~ 0.4 % disconnectedness
300-Dim 1M-vectors
~ 1 % disconnectedness