mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

Add 1 million docs datasource for vector search benchmark #178

Closed mocobeta closed 2 years ago

mocobeta commented 2 years ago

Mainly for convenience though, at least 1 million docs data source would be needed for benchmarking?

mocobeta commented 2 years ago

Just to note: when I tried to index 1 million Wikipedia docs with vector values (200 dimensions), the indexer process suddenly suspended after running for 30 minutes or so; it's reproducible for me.

It looks like this queue becomes empty when it hangs. Screenshot from 2022-05-20 23-25-08

I can index 1 million docs without vector values.

msokolov commented 2 years ago

I wonder if it was taking a long time to merge? Did you look at the log files? Did it complete the baseline index? It's also possible it now takes a long time "rearranging" the index since this will require rebuilding all the vector graphs. I think this rearranging idea was discussed, to enable concurrent indexing while maintaining the same index geometry across multiple indices, but I'm not sure if it was ever fully implemented.

I'm trying now with 1M docs, 100-dimensions. It seems to have completed the baseline index, but I have a bug in the candidate, so not a full run yet

msokolov commented 2 years ago

I was able to index 1M vectors and run vector task benchmarks. I did see occasional pauses for merges, but nothing like 1/2 hour. Well I used 100d vectors. I did run into a pathological case where I had accidentally indexed all zero vectors for every document, and when I did this, indexing took over 5 hours instead of around 11 minutes. Soo... there is a bad thing going on there - maybe you hit that or something like it? We need to sort out what that is for sure.

mikemccand commented 2 years ago

Try using jstack to see what the indexing and merging threads are doing?

Also, rearrange was reverted and I don't think committed again?

mocobeta commented 2 years ago

I'll also try with 100 dimensions and report if I find something noticeable.