typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Avoid multiple results from the same webpage #36

Closed Krinkle closed 1 year ago

Krinkle commented 1 year ago

Description

I'm evaluating Typesense for use on api.jquery.com, qunitjs.com and other OpenJS sites by using typesense/docsearch-scraper with a very minimal vanilla JavaScript client that directly uses the browser's fetch() method to contact the Typesense API /collections/:collection/documents/search endpoint, and pass the result to an HTML template.

I'd like to avoid returning multiple results from the same web page, i.e. under different headings. Rather, only the highest ranking match on a given page should be returned. I initially used group_by: 'url', group_limit: '1' like the official typesense-docsearch.js does. However, the URL includes the anchor. This is great in many ways as it keeps client simple, but also means that grouping by URL does not prevent returning multiple results from the same page since https://example.org/foo#bar and https://example.org/foo#quux are technically different values.

Actual Behavior

document.id

Looking in the returned data.grouped_hits[#].hits[#].document object, it seemed that document.id is the only available property that is both publicly supported and unique to exactly only the document object within a collection. By "publicly supported", I mean that it can be (and is, by the official typesense-docsearch.js client) returned by asking for it via include_fields: '…, content,url,id'.

HTTP 400 Bad Request

{"message": "Cannot use `id` as a group by field."}

Git-blame indicates this was introduced a few months ago as part of https://github.com/typesense/typesense/commit/cf908fb357ecd2e7b4933e3266d46011365637a3, although the context suggests that it was already not supported before that either.

Upon closer inspection, I realize this is in fact not unique across portions of the same webpage. Each scraped element is stored as a separate "document".

document.url_without_anchor

This looks much more promising. However, specifying this in include_fields and group_by leads to HTTP 404 Not Found:

{"message": "Could not find a field named `url_without_anchor` in the schema."}

Metadata

Typesense Version: 0.24.1.

OS: Debian 11 Bullseye.

jasonbosco commented 1 year ago

@Krinkle I've added the ability to group on url_without_anchor in typesense/docsearch-scraper:0.6.0.rc1. Could you give it a shot now?

Krinkle commented 1 year ago

@jasonbosco Works perfectly. Thank you!

jasonbosco commented 1 year ago

Awesome, thank you for confirming!