Closed mocobeta closed 2 years ago
Just to note: when I tried to index 1 million Wikipedia docs with vector values (200 dimensions), the indexer process suddenly suspended after running for 30 minutes or so; it's reproducible for me.
It looks like this queue becomes empty when it hangs.
I can index 1 million docs without vector values.
I wonder if it was taking a long time to merge? Did you look at the log files? Did it complete the baseline index? It's also possible it now takes a long time "rearranging" the index since this will require rebuilding all the vector graphs. I think this rearranging idea was discussed, to enable concurrent indexing while maintaining the same index geometry across multiple indices, but I'm not sure if it was ever fully implemented.
I'm trying now with 1M docs, 100-dimensions. It seems to have completed the baseline index, but I have a bug in the candidate, so not a full run yet
I was able to index 1M vectors and run vector task benchmarks. I did see occasional pauses for merges, but nothing like 1/2 hour. Well I used 100d vectors. I did run into a pathological case where I had accidentally indexed all zero vectors for every document, and when I did this, indexing took over 5 hours instead of around 11 minutes. Soo... there is a bad thing going on there - maybe you hit that or something like it? We need to sort out what that is for sure.
Try using jstack to see what the indexing and merging threads are doing?
Also, rearrange was reverted and I don't think committed again?
I'll also try with 100 dimensions and report if I find something noticeable.
Mainly for convenience though, at least 1 million docs data source would be needed for benchmarking?