mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

[Draft] Code to reproduce the disconnectedness in HNSW using open source datasets #236

Open nitirajrathore opened 1 year ago

nitirajrathore commented 1 year ago

Adding this code for someone to checkout and comment but not merge yet. This change picks up changes from earlier pending PR #234. Added CheckHNSWConnectedness for finding the connectedness of HNSW graph at each level of the graph and also overall disconnected nodes of the graph.

Refactored KnnIndexer out of KnnGraphTester to be able to use standalone and created KnnIndexerMain for it.

Tests

  1. Use existing ant script or new Gradle task to create the vector files for wiki data using the gloVe data. ant vectors300-docs ant vectors100-docs
  2. Create the index for 100 dimension and 300 dimension vectors using the KnnIndexerMain ./gradlew :src:main:run -PmainClass=knn.KnnIndexerMain --args=" -docvectorspath lucene/benchmarks/data/enwiki-20120502-lines-1k-100d.vec -indexpath lucene/benchmarks/indices/vector_index -maxconn 16 -beamwidth 100 -vectorencoding FLOAT32 -similarityfunction DOT_PRODUCT -numdocs 1000000 -dimension 100"
  3. Check for HNSW connectedness ./gradlew :src:main:checkHnswConnected -Pindex="lucene/benchmarks/indices/vector300_index" -Pknn-field="knn"

The script currently only generated single segment indexes. I will try to create multiple segments and check again. But this proves that the there is disconnectedness in graph even when there is less dynamism in creating and updating documents.

Results

100-Dim 1M-vectors

~ 0.4 % disconnectedness

For index /home/nitirar/mywork/lucene/benchmarks/indices/vector100_index field : knn
Level = 0       Total Nodes = 1000000   Reachable Nodes = 996035        Unreachable Nodes = 3965        %Disconnectedness = 0.3965
Level = 1       Total Nodes = 62820     Reachable Nodes = 61442 Unreachable Nodes = 1378        %Disconnectedness = 2.193569
Level = 2       Total Nodes = 3879      Reachable Nodes = 3820  Unreachable Nodes = 59  %Disconnectedness = 1.5210105
Level = 3       Total Nodes = 233       Reachable Nodes = 232   Unreachable Nodes = 1   %Disconnectedness = 0.42918456
Level = 4       Total Nodes = 12        Reachable Nodes = 12    Unreachable Nodes = 0   %Disconnectedness = 0.0
Overall graph :         Total Nodes = 1000000   Reachable Nodes = 996183        Unreachable Nodes = 3817        %Disconnectedness = 0.3817

300-Dim 1M-vectors

~ 1 % disconnectedness

For index /home/nitirar/mywork/lucene/benchmarks/indices/vector300_index field : knn
Level = 0       Total Nodes = 1000000   Reachable Nodes = 990377        Unreachable Nodes = 9623        %Disconnectedness = 0.9623
Level = 1       Total Nodes = 62820     Reachable Nodes = 60100 Unreachable Nodes = 2720        %Disconnectedness = 4.329831
Level = 2       Total Nodes = 3879      Reachable Nodes = 3756  Unreachable Nodes = 123 %Disconnectedness = 3.1709204
Level = 3       Total Nodes = 233       Reachable Nodes = 233   Unreachable Nodes = 0   %Disconnectedness = 0.0
Level = 4       Total Nodes = 12        Reachable Nodes = 12    Unreachable Nodes = 0   %Disconnectedness = 0.0
Overall graph :         Total Nodes = 1000000   Reachable Nodes = 990729        Unreachable Nodes = 9271        %Disconnectedness = 0.9271
mikemccand commented 9 months ago

Should we commit what we have here even though it's still draft?

nitirajrathore commented 9 months ago

Should we commit what we have here even though it's still draft?

Hi @mikemccand. I will do it in parts. First part is to refactor the existing code for which I have raised the PR. Its mostly refactoring so we can immediately commit with someones help. If you find time please review it. : https://github.com/mikemccand/luceneutil/pull/254 Once thats done I will prepare second PR.

CC : @msokolov