Closed larsyencken closed 5 months ago
Where I got to last week:
posts
and posts_gdocs
posts
and posts_gdocs
The results are something like this:
╭─────────┬──────────────────────────────────────────────╮
│ id │ <gdoc id> │
│ slug │ low-carbon-electricity │
│ │ ╭───┬───────────────────────────────╮ │
│ queries │ │ 0 │ decarbonize energy │ │
│ │ │ 1 │ clean electricity │ │
│ │ │ 2 │ energy system electrification │ │
│ │ │ 3 │ low-carbon energy │ │
│ │ │ 4 │ electricity vs energy │ │
│ │ │ 5 │ global energy mix │ │
│ │ ╰───┴───────────────────────────────╯ │
╰─────────┴──────────────────────────────────────────────╯
As an example, if we agree that "decarbonize energy" is a valid query, then so is "decarbonize", since it's part way to the good query.
I intend to post-process these to get all the whole-word query prefixes, which will expand out the query list quite a lot. Then I will reshape the data, grouping by query, to list which documents are valid results for that query. The whole thing should take an hour or two, then I'll call the synthetic data done for now.
I now have a synthetic query dataset, with ~4800 queries. That's probably going to be a bit much for everyday use. I also looked at just single-word queries, which is more like ~850. I might even subsample that to start with.
Today I'll look at replaying and scoring search. My thinking so far is:
precision@4
Of two relevant documents, the scoring algorithm won't care which is first, but my thinking is that we will get that automatically just by ensuring what's there is more relevant.
We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.
We'd prefer to replace the existing measure with real user data, but nonetheless calling this done for now.
What
Develop a dataset and a one-line command for benchmarking search in dev / staging / prod.
Why? Why now?
In this cycle we are trying to improve search, but search is a game of whack-a-mole where changes can easily make some searches easier but others worse.
To truly improve search we would like an objective measure of how much it is improved.
Technical notes
Todo
Generate a staff search dataset