What aKNN dimensionality should we use in nightly benchmark?

mikemccand commented 1 year ago

(Spinoff from discussions with @rmuir and in trying to test https://github.com/apache/lucene/pull/12311 in luceneutil).

Currently the nightly benchmark tests 100 dimensions, but this seems not common/realistic since 1) it is not a power of 2, and 2) it is maybe smallish? Should we pick a more "realistic" number (what)?

In the Vector optimizations, a power-of-two can execute faster than non-power-of-two since the "unrolling" won't need to do any scalar math and can use more pure SIMD.

mikemccand commented 1 year ago

In fact, since power-of-two can be faster, maybe for some methods we should zero-pad? E.g. dotProduct will give the same (ish?) answer if you pad with 0s to the nearest/fastest power of two?

uschindler commented 1 year ago

Zero-padding should be done in the implementation method, not sure if we should really add the overhead (also because of heap space to all algorithms in Lucene). I am not sure how much the Vector API of Panama offers?

rmuir commented 1 year ago

I think it is enough to just use a bigger vector size that better represents the performance issues? Maybe it looks like the current graph for users only using 100 or 256 and they don't complain about the performance?

But try using 756 or 1024, it is a different story. And users are screaming bloody murder wanting thousands and thousands for chatgpt but we are over here benchmarking 100.

msokolov commented 1 year ago

I was able to measure a larger speedup with 768-dim vectors than lower-dimensional ones. For these tests I increased tasksPerCat=20 (from default of 1) to give more chance to warm up. This is pretty slow though, and I think if we did this, we wouldn't need to run 20 JVMs (results seem to converge pretty fast).

768-dimensions, trained using mpnet model

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                        PKLookup      206.34      (2.0%)      208.62      (1.5%)    1.1% (  -2% -    4%) 0.046
                  HighTermVector       73.82      (3.2%)       84.85      (1.6%)   14.9% (   9% -   20%) 0.000
                AndHighMedVector       64.59      (3.4%)       75.64      (1.7%)   17.1% (  11% -   23%) 0.000
               AndHighHighVector       71.74      (3.3%)       88.23      (1.9%)   23.0% (  17% -   29%) 0.000
                AndHighLowVector       74.30      (3.4%)       94.13      (2.1%)   26.7% (  20% -   33%) 0.000
                   MedTermVector       64.84      (3.4%)       83.01      (1.9%)   28.0% (  22% -   34%) 0.000
                   LowTermVector       70.95      (3.2%)       94.43      (2.0%)   33.1% (  26% -   39%) 0.000

Here is a JFR profile from that run

PERCENT       CPU SAMPLES   STACK                                                                                                                                     
49.24%        209604        org.apache.lucene.util.VectorUtil#dotProduct()                                                                                            
2.21%         9405          jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds()                                                                              
2.02%         8590          org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsReader$OffHeapHnswGraph#seek()                                                       
1.94%         8264          org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()                                                                               
1.70%         7249          perf.VectorDictionary#vectorDiv()                                                                                                         
1.69%         7198          jdk.internal.misc.Unsafe#copyMemoryChecks()                                                                                               
1.55%         6612          org.apache.lucene.util.packed.DirectMonotonicReader#get()                                                                                 
1.31%         5588          org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#vectorValue()                                                                  
1.24%         5265          org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()                                                                     
1.10%         4698          java.lang.foreign.MemorySegment#copy()                                                                                                    
1.00%         4261          jdk.internal.foreign.MemorySessionImpl#toSessionImpl()                                                                                    
1.00%         4259          jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw()                                                                               
0.98%         4181          java.lang.invoke.VarHandleGuards#guard_LJ_I()                                                                                             
0.98%         4166          org.apache.lucene.store.DataInput#readVInt()                                                                                              
0.96%         4104          org.apache.lucene.util.SparseFixedBitSet#insertLong()                                                                                     
0.84%         3581          org.apache.lucene.util.LongHeap#upHeap()                                                                                                  
0.83%         3536          org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#readShort()                                                             
0.82%         3477          java.util.Arrays#binarySearch0()        
0.82%         3473          org.apache.lucene.util.packed.DirectReader$DirectPackedReader16#get()
0.80%         3417          java.lang.Object#equals()                                                                                                                 
0.80%         3405          jdk.internal.misc.Unsafe#checkPrimitivePointer()                                                                                          
0.77%         3262          java.util.Objects#requireNonNull()                     
0.76%         3255          org.apache.lucene.util.LongHeap#downHeap()                                                                                                
0.69%         2944          jdk.jfr.internal.JVM#emitEvent()                       
0.67%         2847          org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
0.62%         2650          org.apache.lucene.util.LongHeap#push()               
0.62%         2632          java.lang.invoke.VarHandleSegmentAsBytes#get()
0.58%         2452          org.apache.lucene.store.MemorySegmentIndexInput#readByte()
0.57%         2443          org.apache.lucene.util.SparseFixedBitSet#getAndSet()
0.57%         2411          java.lang.invoke.VarHandleSegmentAsShorts#checkAddress()

384-dimensions, trained using mlm model

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                        PKLookup      179.25      (7.1%)      176.70      (5.5%)   -1.4% ( -13% -   12%) 0.479
                   MedTermVector      327.54     (23.4%)      363.44     (16.3%)   11.0% ( -23% -   66%) 0.086
                  HighTermVector      318.49     (25.6%)      353.68     (16.8%)   11.0% ( -24% -   71%) 0.106
                AndHighMedVector      321.10     (23.9%)      359.66     (12.3%)   12.0% ( -19% -   63%) 0.045
               AndHighHighVector      313.74     (25.1%)      352.26     (19.5%)   12.3% ( -25% -   75%) 0.084
                   LowTermVector      313.92     (26.3%)      354.19     (15.2%)   12.8% ( -22% -   73%) 0.059
                AndHighLowVector      241.66     (20.4%)      290.45     (13.5%)   20.2% ( -11% -   67%) 0.000

I saw bigger speedups on indexing tasks; they are more dedicated to pure vector operations I guess. Here is a comparison using a different script, knnPerfTest.py that is based on KnnGraphTester (using 768-dim mnet vectors):

baseline

recall  query ms        nDoc         index ms
0.838         0.53       10000           7785    
0.755         1.92     100000       179471
0.734         3.99     200000       485050

with change

recall  query ms      nDoc     index ms
0.838         0.37     10000        4492
0.755         1.66   100000      95889
0.734         3.03   200000    260163

I'll follow up with some scripts for generating this data, copy some data to a download dir, and then maybe we can update the nightly setup. I'm a little worried about jumping to 768d vectors just in terms of indexing time. It might take 10 hrs single threaded, or maybe more? Perhaps if we reduced to 8-bit it will go a bit faster, or we could compromise and run the benchmarks using the 384-d vectors. Anyway for now I'll just post the scripts and data.

mikemccand commented 1 year ago

Ooooh these look like great results @msokolov! Thanks for testing. Sort of spooky we still see some "overhead" methods in the top N hotspots (jdk.internal.foreign.AbstractMemorySegmentImpl#checkBounds(), jdk.internal.misc.Unsafe#copyMemoryChecks(), jdk.internal.foreign.MemorySessionImpl#checkValidStateRaw(), etc.), but net/net it looks a lot better than before, I guess because of the increased iterations.

I'll follow up with some scripts for generating this data, copy some data to a download dir, and then maybe we can update the nightly setup.

+1

We may need to get the IndexRearranger solution working (@zhaih!) if indeed the increased dimensionality is slow enough with the fully single threaded indexing. IndexRearranger would allow us to build the same deterministic index much more quickly by using many threads to do the indexing, then rearrange the documents in the end.

uschindler commented 1 year ago

For these tests I increased tasksPerCat=20 (from default of 1) to give more chance to warm up. This is pretty slow though, and I think if we did this, we wouldn't need to run 20 JVMs (results seem to converge pretty fast).

This is something that is a long standig mis-design of the benchmark. At beginning the benchmaker passed special command line flags like "-Xbatch" the the JVM to "pre-optimize". This is no longer possible with recent Java versions (-Xbatch is fine to preoptimize, but it stops at C1; the real C2 optimizations are not applied during batch compilation). Newer JVMs more rely on tiered optimization based on performance analysis.

More complex features take more time to optimize, so we should really have some more "real-world" setups. People use servers running 24/7 executing queries and indexing stuff. So our benchmark should emulate this. Like with JMH we should also add enough warmup "per JVM" before we actually measure and then let the JVM run for much longer time.

The current default settings with the small wikimedia dump are not representative at all and should no longer be used for people to test new lucene features. We should rethink and modify the benchmarks:

add enough warmup before starting to measure. I am not sure what the best way to dynamically determine warmup time based on index size and query pool.
use default JVM settings (tiered compilation) -> that's already done
reduce the number of JVMs started. I think 4 rounds are way enough (default is 20 at moment).

rmuir commented 1 year ago

thanks a lot for posting these indexer benchmarks @msokolov

mikemccand commented 1 year ago

Like with JMH we should also add enough warmup "per JVM" before we actually measure and then let the JVM run for much longer time.

You can control the number of warmup iterations (how many times each query will be executed, and then discarded), but it's not automagic, just a fixed count. And you can specify how many real iterations to run for each task.

reduce the number of JVMs started. I think 4 rounds are way enough (default is 20 at moment).

+1, we should try reducing this.

We used 20 long ago because Hotspot noise was .... very noisy. Maybe hotspot variance is tighter these days and we can get strong enough signal with fewer iterations?

mikemccand commented 1 year ago

These gains are really quite incredible -- I tweeted about it: https://twitter.com/mikemccand/status/1664285218285694979?s=20

I'll try to cutover nightlies to the larger vectors...

mikemccand commented 1 year ago

I wonder if we also have some slowness on reading these float[] (and byte[]) vectors through IndexInput? Maybe the Panama native memory APIs could help with this limitation in our MMap impl?

uschindler commented 1 year ago

I wonder if we also have some slowness on reading these float[] (and byte[]) vectors through IndexInput? Maybe the Panama native memory APIs could help with this limitation in our MMap impl?

We can only fix this at a later stage when the Vector APIs are public. At moment indeed we do duplicate copy (mmap copies to heap array, heap array is wrapped by vector API). In future we can change IndexInput API to return FloatVector instances (new method like readFloatVector) that is directly wrapping a segment slice.

At moment we can't do this as it would require our public API to expose incubation APIs. As hack we could use some new method Object readFloatVector(), which returns null by default and a FloatVector instance hidden behind Object for the new MMAP impl (if the incubating mod is enabled). If it is non-null caller could cast it.

But that's all too crazy for our current code especially with incubation, so we may use this earliest when it goes to preview phase. Full typed support can only be done later. That's the major limitation: We can't expose those APIs unless our main code is on the minimal Java LTS version that has full non-incubating and non-preview support for vectors.

Once this is there we can rewrite much more!

benwtrent commented 1 year ago

@mikemccand what is the process for annotating benchmarks? https://home.apache.org/~mikemccand/lucenebench/indexing.html

I saw throughput and QPS drop and after some time, discovered this change :) Would be good to indicate so folks don't dig down the same rabbit hole.

mikemccand / luceneutil