maurermj08 / efetch

Evidence Fetcher (efetch) is a web-based file explorer, viewer, and analyzer.
Apache License 2.0
37 stars 7 forks source link

Paging Past 10000 #2

Closed maurermj08 closed 8 years ago

maurermj08 commented 8 years ago

When paging past 10000 items elasticsearch throws an exception: " QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [28950]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]"

Below from Elasticsearch's website explains why:

"Deep Paging in Distributed Systems

To understand why deep paging is problematic, let’s imagine that we are searching within a single index with five primary shards. When we request the first page of results (results 1 to 10), each shard produces its own top 10 results and returns them to the coordinating node, which then sorts all 50 results in order to select the overall top 10.

Now imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The coordinating node then sorts through all 50,050 results and discards 50,040 of them!

You can see that, in a distributed system, the cost of sorting results grows exponentially the deeper we page. There is a good reason that web search engines don’t return more than 1,000 results for any query."

https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html

The current work around is:

curl -XPUT "http://localhost:9200/efetch*/_settings" -d '{ "index" : { "max_result_window" : 500000 } }'
maurermj08 commented 8 years ago

The current version of efetch does not depend on Elasticsearch, so I am going to close this bug.