sveltejs / svelte.dev

The Svelte omnisite
https://svelte.dev
111 stars 72 forks source link

Improve search results ranking #875

Open Rich-Harris opened 1 year ago

Rich-Harris commented 1 year ago

Describe the problem

To use @tcc-sejohnson's example: if you search for adapter-static in the docs, the page you're probably looking for — this one — is the fifth result:

image

Describe the proposed solution

I think the easiest and most reliable solution would be to add keywords frontmatter to the relevant markdown files, so that if you match one of them (or a keyword starts with your search term) that document is treated as higher priority than all others.

We could indicate the keyword in the UI somehow but I don't think it's necessary.

Alternatives considered

No response

Importance

nice to have

Additional Information

No response

benmccann commented 1 year ago

3 of the 4 documents that rank above it don't contain adapter-static a single time. It must be tokenizing it into "adapter" and "static". Perhaps we can either remove - as a delimiter character or special case the adapter names to be treated as a single word

benmccann commented 1 year ago

I think there's another bug as well:

https://github.com/sveltejs/kit/blob/f953c9d810be8b9211ce1fa456d9c96224ec55dc/sites/kit.svelte.dev/src/lib/search/search.js#L64

The problem is that sub-sections rank lower than main pages.

https://kit.svelte.dev/docs/adapter-static#usage - because it has a # is automatically pushed to the bottom https://kit.svelte.dev/docs/configuration - despite not even having the text adapter-static jumps to the top because there is no#

It should probably be grouping followed by ranking. I.e. we group by the page and then rank based on the highest ranking sub-section or something like that.

oodavid commented 1 year ago

Might it be best to implement 3rd party search? Algolia is free for open-source, and does a great job of indexing and ranking...

https://www.algolia.com/for-open-source/ https://www.algolia.com/doc/tools/crawler/getting-started/overview/

Edit: Oh, not quite free: 200,000 search requests per month - still, maybe worth budgeting for.

Rich-Harris commented 1 year ago

I've been meaning to write a blog post about this, but there's a variety of reasons we don't want to use third party search tools:

benmccann commented 1 year ago

The only bullet point I'd comment on before you write this blog post is:

We don't want to cede control over the UI or the search results. While it's arguably true that Algolia will have better out-of-the-box results than our homegrown setup (which uses flexsearch), we have the ability to improve it and tailor it as we see fit, which we'd lose if we had something generic

Flexsearch is incredibly hard to customize relative to Algolia, Elastic, or just about any index I've used in the past. I've spent the morning trying and simply can't understand how Flexsearch's scoring works. I've filed a few issues in the Flexsearch repo asking for more details and hope to come back to this after getting some more details about how to tweak Flexsearch.

In the meantime, I've sent a PR which just does some housekeeping on our side: https://github.com/sveltejs/kit/pull/8727

Rich-Harris commented 1 year ago

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

oodavid commented 1 year ago

A very well reasoned response. Personally I'd put results relevance above all of those points.

I've had some success in the past with Typesense, IIRC it has a rational approach to ranking and relevance. Might be worth a peek:

https://typesense.org/docs/guide/ranking-and-relevance.html

Flexsearch has a list of other libraries, benchmarked:

https://nextapps-de.github.io/flexsearch/bench/

enBonnet commented 1 year ago

Typesence looks really cool @oodavid.

Could we try implementing it? I'd like to participate.

Hetarth02 commented 1 year ago

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

enBonnet commented 1 year ago

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

There are a lot of options, we should put our focus on the problem we wanna resolve and look at which one is the best for doing it.

The current problem seems to be the priorities.

benmccann commented 1 year ago

I'm open to alternatives as I don't particularly like flexsearch, but it'd be nice to find one that allows us to keep the functionality that we have today. In particular today you can see results as you type and many of the tools mentioned above don't appear to support that. The search we use today also does not require any extra infrastructure. I'm not sure if any of the tools mentioned are great fits, but would love if someone can find one that fits the bill.

Hetarth02 commented 1 year ago

I am not biased towards lunrjs but I have been working with it currently and I think it checks off all your requirements.

@benmccann I didn't get the "search based off prefix" part can you please explain?(If possible with a small example)

benmccann commented 1 year ago

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

Hetarth02 commented 1 year ago

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

Correct me if I am wrong but are you perhaps talking about auto-complete?

Would this be something we are looking for?

Autocomplete library by Algolia

benmccann commented 1 year ago

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

Hetarth02 commented 1 year ago

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

I see then perhaps is this what we are looking for,

Wildcards Lunrjs

I think this can reproduce the same functionality you are talking about.

benmccann commented 1 year ago

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

Hetarth02 commented 1 year ago

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

I can try to make a prototype. Can anyone guide me through some of the steps to setup the code for docs locally?

@benmccann @enBonnet

Rich-Harris commented 1 year ago

Can anyone guide me through some of the steps to setup the code for docs locally?

You'll need to have pnpm installed, then...

git clone git@github.com:sveltejs/kit
cd kit
pnpm install
cd sites/kit.svelte.dev
pnpm dev

...and you should be off to the races!

Rich-Harris commented 1 year ago

One thing I'll note is that the web worker that powers our current search — which includes all of flexsearch plus our logic that sits around it — is 18kb of unminified code (though it probably should be minified, not sure why it isn't).

By contrast, lunr by itself weighs 99kb. Probably not a dealbreaker but something to be conscious of.

kevmodrome commented 1 year ago

I suspect you want to keep the search locally on the client, but if you're looking for an alternative to algolia there's meilisearch: https://docs.meilisearch.com - though 11kb minified+zipped

benmccann commented 1 year ago

lunr is only 29k minified, so it's not too bad. The thing that I just noticed that gives me more hesitation is that it appears to basically be abandoned. It hasn't been updated since 2020, it still uses Travis CI, there's a number of unreviewed PRs, etc. It'd be nice if we could find something that's a bit better maintained

benmccann commented 1 year ago

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

Hetarth02 commented 1 year ago

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

Thanks for you suggestion, I will try to use this.

gtm-nayan commented 1 year ago

I was trying out minisearch and elasticlunr yesterday, @Hetarth02 you can continue from those branches if it saves you some setup time.

You'll need to be using Chrome for this btw since Firefox doesn't yet support module workers.

Hetarth02 commented 1 year ago

@gtm-nayan Thanks for your help, by the way any noticeable results you got from using minisearch. Also, if you want we can co-ordinate with each other and work on this topic together.

gtm-nayan commented 1 year ago

Minisearch gives out a lot more results than our current setup but I think that's due to the combineWith setting, changing it to "AND" reduces the number of results but there's no way to do that on a per-field basis. Minisearch did improve the query originally mentioned in this issue ie. searching for adapter-static leads to the static site generation page, and I didn't see any glaring problems yet but still have to test for other common queries.

boian-ivanov commented 1 year ago

There's an up-and-coming in-memory search engine fully build from the ground up to be performant for full-text search, called Lyra. The project seems quite intuitive and the people behind it are constantly improving it. It might be worth giving it a shot for the docs 🤔

gtm-nayan commented 1 year ago

Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs

run node index.mjs "query goes here"

would be great if folks could help with the evaluation, i.e. compare the results it gives for something you searched recently against the current setup on kit.svelte.dev and share the findings here

karimfromjordan commented 1 year ago

I just tried it out:

❯ node index.mjs "ssr"
ssr
[
  '/docs/single-page-apps#prerendering-individual-pages',
  '/docs/types#public-types-server',
  '/docs/page-options#prerender-prerender-and-ssr',
  '/docs/routing#layout-layout-server-js',
  '/docs/page-options#csr',
  '/docs/routing#page-page-svelte',
  '/docs/types#public-types-ssrmanifest',
  '/docs/state-management#using-stores-with-context',
  '/docs/routing#layout-layout-js',
  '/docs/load#universal-vs-server-when-does-which-load-function-run'
]

In the current docs the first result is /docs/page-options#ssr which doesn't seem to be included here in the search results.

benmccann commented 1 year ago

Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs

Amazing! Thank you @gtm-nayan!

I just tried it out: node index.mjs "ssr" In the current docs the first result is /docs/page-options#ssr which doesn't seem to be included here in the search results.

Hmm. That's funny. I just tried the command you shared and that page was the second result. Perhaps @gtm-nayan made some improvements

I also tested against the string "adapter-static", which was the original one filed here and it returned the "Static Site Generation" section first as expected.

One that could be better is "assets". I was hoping to see the asset handling page returned higher. Turning on stemming helped quite a bit and boosting the breadcrumbs helped some as well as shown below. I think we may be able to do still better by splitting the breadcrumb into fields like h1, h2, h3, so that we can give a higher boost to larger headings. Right now we can't do that which makes it really hard to get the asset handling page back first since other chunks have the term "assets" in their lower headings.

  const index = await create({
    schema: {
      breadcrumbs: 'string[]',
      content: 'string',
    },
    components: {
      tokenizer: { language: 'english', stemming: true },
    },
  });

  await insertMultiple(index, blocks);

  const results = await search(index, {
    term: query,
    boost: {
      breadcrumbs: 2,
    },
  });

Orama is vastly better than fastsearch from an API perspective. I love how easy it is to boost a field, which I couldn't figure out with flexsearch if I recall.

Another thing I noticed is that we may want to do something to put the migration guide and possibly the types towards the end of the search results. I know we at least discussed that with the current doc search, but can't remember if we implemented it or not.

A small thing I noticed independent of which library we use is that we divide most of the articles into very small chunks, but then leave the config page as a single chunk. That could be worth tweaking.

Finally, something I just noticed in their docs is that they have a grouping functionality. I know we do some grouping on the results after they're returned, so it might be interesting to see if this feature would be useful to us: https://docs.oramasearch.com/usage/search/grouping

Overall, I'd love to switch to Orama. It seems way easier to use, so if we need to make any tweaks I'm a lot more confident we'll be able to do that. Also, my questions in the flexsearch repo have gone unanswered and flexsearch has no commits this year whereas orama seems much more actively developed.

PuruVJ commented 1 year ago

I'm a bit skeptical about Orama based on this https://github.com/oramasearch/orama/issues/76#issuecomment-1352410224

Unless this has been resolved somehow?

benmccann commented 1 year ago

That doesn't seems like a deal breaker to me. While it'd nice if it would take into account whether words are found consecutively, I'm not sure how often that would result in different search rankings and there are other ways in which Orama's search is better. I would expect that being able to have the scoring take into account whether a term is found in the first heading, second heading, or content would have a larger effect on search quality and Orama beats flexsearch there.

I actually think the most interesting part of that post is that it lists several other options that I'm not sure we've investigated yet. The ones it lists as potentially performing better than Orama in that one particular benchmark (which is not terribly representative of actual usage) are bulksearch, jsii, wade, and js-search, so it might be worth checking those out as well.

gtm-nayan commented 1 year ago

Size would be a factor as well, the current search implementation on the kit site, including flexsearch and the components is about 24kB minified, just the playground I linked above is about 55.7 kB after minification.

benmccann commented 1 year ago

jsii says it's not maintained js-search can't do per-field boosting wade doesn't appear to do stemming, substring matching, or per-field boosting bulksearch doesn't appear to do stemming or per-field boosting

I still think Orama is going to give us the best results. Especially since the test in https://github.com/oramasearch/orama/issues/76#issuecomment-1352410224 is so unlike our usecase. It was searching over all of HarryPotter so there's going to be tons more matches for any query and it doesn't have headings that can be used for per-field boosting

Orama is larger. Though it also uses a lot less memory. I don't think we're likely to find a single library that wins across all metrics.

benmccann commented 1 year ago

I filed an issue with a suggestion for making Orama a bit smaller: https://github.com/oramasearch/orama/issues/418

I think we could also mitigate it on our end by loading the search functionality in onMount so that we don't block the page load

micheleriva commented 1 year ago

Hi @benmccann, Orama author here. Thanks a lot for considering Orama! Other frameworks and libraries are migrating to Orama for their documentation (think of Fastify, Platformatic, and more).

The benchmark posted above is not representative of the current status of Orama performance-wise, and it's based on an older version (when it was pre-1.0.0 and it was called Lyra). We're now over v1.0.0 with stable APIs and significant performance improvements.

It can work 100% client-side, so you own and manage your data.

I hear your concerns about the bundle size, and I'd love to take this as an opportunity to optimize it, starting from your use case.

You have all of my and my company's support for this. I'll continue the conversation in the Orama repo on your issue https://github.com/oramasearch/orama/issues/418 🙂

leeoniya commented 1 year ago

a bit late to the party, but i'll drop this here:

https://github.com/leeoniya/uFuzzy#a-biased-appraisal-of-similar-work

i'd be interested in benchmarking Orama vs Lyra. lyra didn't come out very fast in my tests, though they're biased towards partial substring matches. Flexsearch is for sure the performance king if you can spare the ram for a giant index.

benmccann commented 1 year ago

Sharing an update here from https://github.com/oramasearch/orama/issues/418:

We just published Orama v1.0.7 and went from 20kb to 13kb gziped. If you import the search function only, it will cost around 4.88kb gziped (it was around 11kb yesterday)

benmccann commented 1 year ago

@gtm-nayan your stackblitz isn't working for me anymore. Is there a way to go back to a working version of it?

gtm-nayan commented 1 year ago

Whoops, forgot to revert the changes after checking the bundle size. Fixed now.

PuruVJ commented 1 year ago

https://github.com/sveltejs/site-kit/pull/162

benmccann commented 3 days ago

I've worked quite a bit on search ranking problems professionally. The root of all our issues in coming up with a better solution here is that we have no feedback mechanism. The way all search engines work is by watching to see what the user clicked. Then you can test each search model you develop to see if it's putting the clicked result towards the top.

I don't think that training a search model has to be privacy invasive. We can do it without tracking any information related to who the user might be.