typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Early position should have positive instead of negative affect on priority #37

Closed Krinkle closed 1 year ago

Krinkle commented 1 year ago

Description

When using typesense/docsearch-scraper, in combination with Search API requests such as those by typesense-docsearch.js, I found that the lead paragraph was suprisingly downranked, and thus rarely returned as snippet in results.

As part of evaluating Typesense for use on https://api.jquery.com and https://qunitjs.com, I found my local branch with Typesense often return suboptimal results compared to the live site (still using Aloglia).

The legacy Algolia scraper gave priority to headings and content earlier on the page using the position attribute. Anecdotally, this appears to be useful. I imagine it is common in documentation to use similar phrasing multiple times, and I feel the first mention is likely a more useful starting point to start reading the page.

My assumption is that Typesense has not (intentionally) inverted this logic. As such, I'm tentatively reporting this as a bug. I'm happy to hear otherwise!

Steps to reproduce

Two examples:

https://api.qunitjs.com/assert/propEqual/ contains:

https://qunitjs.com/intro/ contains:

Actual Behavior

Query Typesense Algolia
compare Screenshot Screenshot
firefox Screenshot Screenshot

Other information

Looking at the implementation, I believe these two chunks of code seem related to this problem.

https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0.rc1/scraper/src/strategies/algolia_settings.py#L61

            'customRanking': [
                'desc(weight.page_rank)',
                'desc(weight.level)',
                'asc(weight.position)'
            ],

It believe customRanking is no longer used, but historically this gave earlier positions (ascending) preference when rank/level are equal.

https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0.rc1/scraper/src/typesense_helper.py

        transformed_record['item_priority'] = transformed_record['weight']['page_rank'] * 1000000000 + \
                                              transformed_record['weight']['level'] * 1000 + \
                                              transformed_record['weight']['position']

I believe item_priority is what is used now, however by adding the numbers up this boosts content that is further down the page, and thus effectively downranks the lead paragraph and other content that is earlier on the page.

jasonbosco commented 1 year ago

In your search request to Typesense, could try adding sort_by: item_priority:desc as an additional parameter and see if that helps?

Krinkle commented 1 year ago

@jasonbosco I am using this:

    // https://typesense.org/docs/0.24.1/api/search.html
    const resp = await fetch(
      `${this.origin}/collections/${this.collection}/documents/search?` + new URLSearchParams({
        q: query,
        per_page: '5',
        // based on https://github.com/typesense/typesense-docsearch.js/blob/3.4.0/packages/docsearch-react/src/DocSearchModal.tsx#L193
        query_by: 'hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content',
        include_fields: 'hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,url_without_anchor,url,id',
        highlight_full_fields: 'hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content',
        // "group_by=url_without_anchor" requires typesense/docsearch-scraper:0.6.0.rc1
        group_by: 'url_without_anchor',
        group_limit: '1',
        sort_by: 'item_priority:desc',
        snippet_threshold: '8',
        highlight_affix_num_tokens: '12',
        'x-typesense-api-key': this.key,
      }),

https://github.com/Krinkle/typesense-minibar/blob/1beacff99b0fe609a91834b398c44b23a21ae5a5/typesense-minibar.js#L113

jasonbosco commented 1 year ago

I re-read your original report in detail and just realized what was happening. I've pushed out a potential fix in typesense/docsearch-scraper:0.6.0.rc2.

Could you give it a shot now?

Krinkle commented 1 year ago

@jasonbosco Worked perfectly. Thanks again!