Develop a dev measure for search quality

larsyencken commented 6 months ago

What

Develop a dataset and a one-line command for benchmarking search in dev / staging / prod.

Why? Why now?

In this cycle we are trying to improve search, but search is a game of whack-a-mole where changes can easily make some searches easier but others worse.

To truly improve search we would like an objective measure of how much it is improved.

Technical notes

Staff data: we ran a team hour and made a staff dataset about what results they expected to find for different queries
Synthetic data: Daniel proposed extracting potential queries for every article (and every chart?)
User data: we could download queries and what users clicked on; if we don't have that, we could begin logging that

Todo

[x] Generate a synthetic search dataset
- [x] Get LLM keywords for every post/article
- [x] Assemble keywords into a search dataset
[ ] ~~Generate a staff search dataset~~
[ ] Generate a scoring tool
- [ ] Fetch result rankings based on the search results page
- [x] Assess result rankings against query datasets
[x] Share and get review

larsyencken commented 6 months ago

Where I got to last week:

I got LLM-suggested synthetic queries for every row of posts and posts_gdocs
I checked for 301 redirects for every slug in posts and posts_gdocs

The results are something like this:

╭─────────┬──────────────────────────────────────────────╮
│ id      │ <gdoc id>                                    │
│ slug    │ low-carbon-electricity                       │
│         │ ╭───┬───────────────────────────────╮        │
│ queries │ │ 0 │ decarbonize energy            │        │
│         │ │ 1 │ clean electricity             │        │
│         │ │ 2 │ energy system electrification │        │
│         │ │ 3 │ low-carbon energy             │        │
│         │ │ 4 │ electricity vs energy         │        │
│         │ │ 5 │ global energy mix             │        │
│         │ ╰───┴───────────────────────────────╯        │
╰─────────┴──────────────────────────────────────────────╯

As an example, if we agree that "decarbonize energy" is a valid query, then so is "decarbonize", since it's part way to the good query.

I intend to post-process these to get all the whole-word query prefixes, which will expand out the query list quite a lot. Then I will reshape the data, grouping by query, to list which documents are valid results for that query. The whole thing should take an hour or two, then I'll call the synthetic data done for now.

larsyencken commented 6 months ago

I now have a synthetic query dataset, with ~4800 queries. That's probably going to be a bit much for everyday use. I also looked at just single-word queries, which is more like ~850. I might even subsample that to start with.

Today I'll look at replaying and scoring search. My thinking so far is:

To only score results in the first page (i.e. ignore collapsed results)
Since we show up to four Research & Writing results uncollapsed, my thinking is to calculate precision@4
- i.e. the proportion of those results that are relevant
Once we know what's relevant for charts and explorers, we can consider something more like Mean Average Precision, which is like a weighted score that really really wants the top results to be good ones

Of two relevant documents, the scoring algorithm won't care which is first, but my thinking is that we will get that automatically just by ensuring what's there is more relevant.

larsyencken commented 5 months ago

We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.

larsyencken commented 5 months ago

We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.

owid / owid-grapher