icgc-argo / song-search

Song Search - GQL microservice for searching maestro generated song indexes
GNU General Public License v3.0
0 stars 0 forks source link

🐛 Query info.totalHits caps at 10000 if index has more than 10000 documents #51

Closed jaserud closed 3 years ago

jaserud commented 3 years ago

Describe the bug

If index has more then 10000 documents, the info.totalHits only returns 10000 as value even if there are actaully more than that.

Example using it on an index where files_centric index has more than 10000 documents: image

This is only an issue if file_centric (or analysis_centric) has more than 10000 documents.

Steps To Reproduce

Steps to reproduce the behaviour:

  1. connect song-search to es with file and analysis indices each one with more than 10000 documents
  2. run query like in example above. So, query with page from > 10000 and query for info.totalHits
  3. See the incorrect response

Expected behaviour

Should return correct number of total hits. The bug is most likely due to elastic search and its 10000 query limit so the fix might be their.

Note: this could be an issue in workflow-search too, but need to check.

rosibaj commented 3 years ago

https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-request-body.html#request-body-search-track-total-hits

rosibaj commented 3 years ago
jaserud commented 3 years ago

For aggregations, track_total_hits is need if we want totalHits to be correct from search: image

However, we will still get an error if aggregation returns more buckets than max_buckets image

rosibaj commented 3 years ago

@jaserud thanks - i guess song-search is not using any buckets explicitly, but should we set this to be the same as totalhits anyways?

jaserud commented 3 years ago

@rosibaj The search.max_bucket is actually a cluster wide config, and I'm not sure how it affects the cluster so I think it might need a bit more thinking before setting it to a default based on song-search's totalHits. The default for this value is supposed to be 65,536 actually.

Currently song-search has no bucket aggs. Workflow search has one on the run.states (bucket runs by states) and there are only a dozen or so states.

So my two cents, I think the current rdpc-gateway shouldn't have any issues with the default max_buckets so we can leave it their :thinking:.

jaserud commented 3 years ago

The fix was released in 2.6.0 which has already been released to prod