opensearch-project / OpenSearch

๐Ÿ”Ž Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.68k stars 1.79k forks source link

[BUG] Page size calculation with Point-in-Time Pagination with Search Slicing #16272

Open nwoike opened 1 week ago

nwoike commented 1 week ago

Describe the bug

I am seeing unexpected behavior when using Point-in-Time Pagination with Search Slicing and wanted to verify if this is expected.

Search Slicing with PIT appears to behave something like the following:

Say we have 50,500 documents matching the PIT query. The maximum number of documents that OpenSearch can page at a time is 10,000. Currently we are dividing that 50,500 by 10,000 and rounding up to get a max of 6 pages to make sure we get those 500 extra documents.

I expected OpenSearch to break this up into pages of [10000, 10000, 10000, 10000, 10000, 500].

What I am actually seeing is OpenSearch break this up into random page sizes of [9950, 10100, 9905, 10008, 10007, 530]. Since OpenSearch can only return 10,000 documents per page, 115 documents are discarded ([0, 100, 0, 8, 7, 0]) in this example.

I would expect OpenSearch to not create pages larger than its maximum page size in this case.

Related component

Search

To Reproduce

  1. Create a PIT
  2. Use PIT to count number of documents matching your query
  3. Use PIT and Search slicing to paginate over matches and keep running total of documents
  4. Verify that original count does not match number of documents that were paged through.

Expected behavior

Point-in-Time with Search Slicing should not create pages larger than the maximum page size for OpenSearch.

Additional Details

Host/Environment (please complete the following information):

dblock commented 1 week ago

Do you think this is reproducible in a test? Want to try to write one?

nwoike commented 1 week ago

I am leaning towards this being a misunderstanding on my part but would like some clarification.

I do think it is reproducible as I have just tested this at much smaller page sizes, here is an example at 200.

Total Index Documents: 1700 Total Index Documents matching Query: 1450 PIT w/Search slicing Actual Processed Documents: 1298

Query Size: 200 Max Slice Count: 8 (ceil(1450/200))

Total hits by page id: [161, 133, 140, 290, 262, 147, 170, 147] = 1450 Hits by page id (limited by size 200): [161, 133, 140, 200, 200, 147, 170, 147] = 1298

The +90 and +62 account for the 152 missing documents in this example. 8 pages should be enough to retrieve 1600 documents (at a size of 200) and my query only returns 1450 so why am I not getting them all.

I guess the question comes down to this:

Is there a mechanism to calculate the correct Max slice count or a safe size to use with Search Slicing to avoid single pages not containing all of their hits.

nwoike commented 1 week ago

For now I am just going to set size on the query to the maximum of 10,000 and not derive it based on the slice count calculation.

dblock commented 5 days ago

I think you narrowed it down to the right question. Maybe @bharath-techie can help us out here?

bharath-techie commented 4 days ago

Since slice is based on the doc id, I doubt there is a predictable way of defining the size / slice. [ Other search experts can correct me here ]

So your approach of increasing size already seems to be the right way to counter this.

But maybe have a 2x multiplier to be safe ?

So in the original example , instead of dividing 50500 into 6 pages with size of 10000 , we can divide into 10 or 11 pages / slices with size of 10000. [ 50500 / 12 = ~ 5000 = > 2x = 10000 size ]

In the updated example with smaller number of docs , we can again have size of 400 and keep the slices same.

msfroh commented 4 days ago

Since slice is based on the doc id, I doubt there is a predictable way of defining the size / slice. [ Other search experts can correct me here ]

@bharath-techie is correct. Slicing is intended to "approximately" divide up the work into disjoint parts based on doc ID hashes, in order to process those parts in parallel (while guaranteeing that every doc is in a slice, and no docs are in multiple slices).

While doc ID hashes tend to be uncorrelated with any other query you might be running, there's still variance in the distribution of docs across slices. Given that the expected slice size with 1450/8 is roughly 181, I think 292 is a little higher than I would have expected. I implemented something similar ~11 years ago and found that the variance was the expected value, so we're ~8 standard deviations out. I think it should follow a Poisson distribution and not a normal distribution, so it's less concentrated around the mean. Unfortunately, I forgot enough stats that I don't remember how to calculate P(k >=290) when ๐€ = 181. (Using an online Poisson calculator, I can't plug in 290 and 181, but I found that P(k >= 29) when ๐€ = 18 is about 1%. I think the distribution becomes even flatter for higher values of ๐€, so the probability may be even higher.)

tl;dr: Using more slices would help (at least reduce the likelihood of getting more docs than your pages). Alternatively, if you want to paginate and get consistent page sizes, instead of slicing, you can sort by _doc ascending, and then use the search_after API to start after the _doc returned by the previous page.