owid / owid-grapher

A platform for creating interactive data visualizations
https://ourworldindata.org
MIT License
1.38k stars 230 forks source link

Develop a dev measure for search quality #3378

Closed larsyencken closed 5 months ago

larsyencken commented 6 months ago

What

Develop a dataset and a one-line command for benchmarking search in dev / staging / prod.

Why? Why now?

In this cycle we are trying to improve search, but search is a game of whack-a-mole where changes can easily make some searches easier but others worse.

To truly improve search we would like an objective measure of how much it is improved.

Technical notes

Todo

larsyencken commented 6 months ago

Where I got to last week:

The results are something like this:

╭─────────┬──────────────────────────────────────────────╮
│ id      │ <gdoc id>                                    │
│ slug    │ low-carbon-electricity                       │
│         │ ╭───┬───────────────────────────────╮        │
│ queries │ │ 0 │ decarbonize energy            │        │
│         │ │ 1 │ clean electricity             │        │
│         │ │ 2 │ energy system electrification │        │
│         │ │ 3 │ low-carbon energy             │        │
│         │ │ 4 │ electricity vs energy         │        │
│         │ │ 5 │ global energy mix             │        │
│         │ ╰───┴───────────────────────────────╯        │
╰─────────┴──────────────────────────────────────────────╯

As an example, if we agree that "decarbonize energy" is a valid query, then so is "decarbonize", since it's part way to the good query.

I intend to post-process these to get all the whole-word query prefixes, which will expand out the query list quite a lot. Then I will reshape the data, grouping by query, to list which documents are valid results for that query. The whole thing should take an hour or two, then I'll call the synthetic data done for now.

larsyencken commented 6 months ago

I now have a synthetic query dataset, with ~4800 queries. That's probably going to be a bit much for everyday use. I also looked at just single-word queries, which is more like ~850. I might even subsample that to start with.

Today I'll look at replaying and scoring search. My thinking so far is:

Of two relevant documents, the scoring algorithm won't care which is first, but my thinking is that we will get that automatically just by ensuring what's there is more relevant.

larsyencken commented 5 months ago

We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.

larsyencken commented 5 months ago

We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.