mediacloud / news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
https://mediacloud.org
GNU Affero General Public License v3.0
1 stars 3 forks source link

ES caps total hits at 10K? #26

Closed philbudne closed 9 months ago

philbudne commented 9 months ago

It looks like https://github.com/mediacloud/story-indexer/issues/166 is an ES query API issue, so opening issue here.

philbudne commented 9 months ago

Summary: To get real total hits, you now need to set a (possibly expensive) parameter track_total_hits:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#track-total-hits

@rahulbot wrote:

Ugh. That's ridiculous (from a database user perspective). This is critical data we need to show users on every search result. Could we: a. sum the attention-over-time data and use that to provide an estimated total? (Django front-end server could do this, or mediacloud-news-search library used by mc-providers) b. turn on the expensive solution in staging and measure true impact on our data?

philbudne commented 9 months ago

The "b." alternative begs a question: how best to develop/test news-search-api, which I've created this issue for: https://github.com/mediacloud/news-search-api/issues/27

pgulley commented 9 months ago

Sorry- I missed this thread. I've just gone ahead and added the track_total_hits parameter to the overview query. I'll ping IA tomorrow to see how they addressed this issue in their system, but I figure we can play with this for now.

rahulbot commented 9 months ago

I'm getting real numbers back in the providers api now (ie. bigger than 10,000) 🎉 Is this fix deployed?

Cursor_and_api-provider…__8__-_JupyterLab
pgulley commented 9 months ago

Yes! Closing