Closed Krinkle closed 1 year ago
@Krinkle By default, Typesense removes all special characters when indexing content. So jQuery.ajax
gets indexed as jQueryajax
and since Typesense does a prefix search by default, searching for ajax
doesn't return jQueryajax
.
Now, in the scraper, we've set -
and _
as token separators. If you add .
to that config, then jQuery.ajax
will get indexed as jQuery
and ajax
and then searching for ajax
will return jQuery.ajax
.
To customize this in the scraper, you want to add the following to the scraper config and re-run the scraper:
{
"index_name": "abc",
...
"custom_settings": {
"token_separators": ["_", "-", "."] // <=== add . here
}
}
Finally, to prioritize the page with all the jQuery.* terms, you want to set the page_rank
field for those URLs appropriately as described here.
@jasonbosco Thank you. That worked really well. I'd say the results now carry objectively higher quality in this regard than with Algolia!
That’s great to hear!
Description
This is in preparation for jquery.com and other jQuery-adjacent sites. I've left the title vague as I'm open less interested in a specific solution and more in understanding the "right" way to do it, or what the closest thing is that we could improve to make it possible.
Some of the most popular methods part of the jQuery API are its static methods such as
jQuery.ajax
andjQuery.get
, or classes likejQuery.Deferred
. However, I'm struggling to get these into the first 5 results.Expected Behavior
For
jQuery.ajax
to rank in the top 5 when searching for "ajax". ForjQuery.get
to rank high when searching for "get". ForjQuery.Deferred
to rank high when searching for "Deferred".Actual Behavior
My guess is that Typesense places a lot of value on prefix matching. The results for "aja" and "ajax" are basically both leading to the same results, based on it potentially completing to "ajaxComplete". This is great, but I would love it if when a whole word match exists (words between spaces or other word boundaries like dots) to consider that more valuable. E.g. if there was a page called "My Aja" then that is perhaps more relevant to "aja" than "ajafoobar".
But, maybe there is another way I should approach this? I noticed there is remnant configuration in docsearch-scraper relating to synonyms, but I couldn't find a way to use it. Is that currently supported? If so, would that work? E.g. could I special case these dozen or so pages so that "ajax" is considered a synomym in search queries for (also) matching "jQuery.ajax"? Or perhaps the other way around, could I ignore "jQuery" like a stopword in content and thus index it as if the page was called "ajax" (which would presumably rank higher).
Alternatively, if I gave these dozen pages a very high hardcoded ranking, I worry that that would make them appear even when searching for random words that happen to be on the page which seems likely to regress result quality.
https://jquery.github.io/typesense-minibar/demo/compare--docsearch-3.html
With the way Typesense DocSearch v3 queries the Typesense API, the results for "ajax" are
Page
jQuery.ajax
is absent.I also tried with my typesense-minibar project (work in progress), which differs in its use of
group_by=url_without_anchor
andsort_by=item_priority:desc
. https://jquery.github.io/typesense-minibar/demo/. That fixes the issue of reporting the same page three times, but is otherwise very similar:Page
jQuery.ajax
is absent.Metadata
Typesense Version: 0.24.0
OS: Debian 11 Bullseye