mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
8 stars 12 forks source link

suspicious systemic caching inefficiency related to user web searches? #645

Closed rahulbot closed 2 months ago

rahulbot commented 2 months ago

We need to start paying attention to the performance of our search system more closely. A first item I was thinking about is how I think (1) total attention, (2) attention over time, (3) language, (4) domains, (5) TLDs, and (6) sample stories right now are all being served by the same news-search-API endpoint under the hood. I think each of these ends up calling the overview query endpoint. Evidence: see news-search-api source and the number of times _overview_query is called in the mediacloud-search-api client.

We are caching the results in Django, but when a user hit search I think it's firing off ~6(!) requests in parallel from the browser->Django->ES that are all asking for those overview results at the same time, and it hasn't been cached yet the first time they search. I think this means that each user generated query from the website is causing way more work than it needs to.

Potential fixes (if I'm right):

rahulbot commented 2 months ago

In short, this fix is necessary, but not sufficient.

More detail: I dug into the fix and understand why it isn't working. Right now the caching is done by the mc-provider using function-name, method args, and method kwargs. This is smart for cross-platform search. However, for both Media Cloud and Wayback Machine providers the various methods call the same function under the hood... so the caching isn't speeind things up because providers doesn't know that (for instance) count and sample are both calling the same thing under the hood. I'll consider alternatives and move issue to mc-providers.

rahulbot commented 2 months ago

Just-pushed changes (cache-related) make this way faster for most queries.