typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

How to rank "word Foo" before foobar, fooz, fooquux, etc. #40

Closed Krinkle closed 1 year ago

Krinkle commented 1 year ago

Description

This is in preparation for jquery.com and other jQuery-adjacent sites. I've left the title vague as I'm open less interested in a specific solution and more in understanding the "right" way to do it, or what the closest thing is that we could improve to make it possible.

Some of the most popular methods part of the jQuery API are its static methods such as jQuery.ajax and jQuery.get, or classes like jQuery.Deferred. However, I'm struggling to get these into the first 5 results.

Expected Behavior

For jQuery.ajax to rank in the top 5 when searching for "ajax". For jQuery.get to rank high when searching for "get". For jQuery.Deferred to rank high when searching for "Deferred".

Actual Behavior

My guess is that Typesense places a lot of value on prefix matching. The results for "aja" and "ajax" are basically both leading to the same results, based on it potentially completing to "ajaxComplete". This is great, but I would love it if when a whole word match exists (words between spaces or other word boundaries like dots) to consider that more valuable. E.g. if there was a page called "My Aja" then that is perhaps more relevant to "aja" than "ajafoobar".

But, maybe there is another way I should approach this? I noticed there is remnant configuration in docsearch-scraper relating to synonyms, but I couldn't find a way to use it. Is that currently supported? If so, would that work? E.g. could I special case these dozen or so pages so that "ajax" is considered a synomym in search queries for (also) matching "jQuery.ajax"? Or perhaps the other way around, could I ignore "jQuery" like a stopword in content and thus index it as if the page was called "ajax" (which would presumably rank higher).

Alternatively, if I gave these dozen pages a very high hardcoded ranking, I worry that that would make them appear even when searching for random words that happen to be on the page which seems likely to regress result quality.

typesense-docsearch 3 typsense-minibar Algolia
Screenshot

https://jquery.github.io/typesense-minibar/demo/compare--docsearch-3.html

With the way Typesense DocSearch v3 queries the Typesense API, the results for "ajax" are

  1. jQuery Core 3.0 Upgrade Guide > paragraph "does not have impact on the ajax callbacks"
  2. jQuery Core 3.0 Upgrade Guide > heading "Ajax"
  3. jQuery Core 3.0 Upgrade Guide > heading "Breaking change: Special-case Deferred methods removed from jQuery.ajax"
  4. ajaxComplete event
  5. ajaxSuccess event

Page jQuery.ajax is absent.

I also tried with my typesense-minibar project (work in progress), which differs in its use of group_by=url_without_anchor and sort_by=item_priority:desc. https://jquery.github.io/typesense-minibar/demo/. That fixes the issue of reporting the same page three times, but is otherwise very similar:

  1. ajaxComplete event
  2. ajaxSuccess event
  3. ajaxError event
  4. ajaxSend event
  5. ajaxStart event

Page jQuery.ajax is absent.

Metadata

Typesense Version: 0.24.0

OS: Debian 11 Bullseye

jasonbosco commented 1 year ago

@Krinkle By default, Typesense removes all special characters when indexing content. So jQuery.ajax gets indexed as jQueryajax and since Typesense does a prefix search by default, searching for ajax doesn't return jQueryajax.

Now, in the scraper, we've set - and _ as token separators. If you add . to that config, then jQuery.ajax will get indexed as jQuery and ajax and then searching for ajax will return jQuery.ajax.

To customize this in the scraper, you want to add the following to the scraper config and re-run the scraper:

{
  "index_name": "abc",
  ...
  "custom_settings": {
    "token_separators": ["_", "-", "."] // <=== add . here
  }
}

Finally, to prioritize the page with all the jQuery.* terms, you want to set the page_rank field for those URLs appropriately as described here.

Krinkle commented 1 year ago

@jasonbosco Thank you. That worked really well. I'd say the results now carry objectively higher quality in this regard than with Algolia!

ScreenshotScreenshot

jasonbosco commented 1 year ago

That’s great to hear!