I'm evaluating Typesense for use on api.jquery.com, qunitjs.com and other OpenJS sites by using typesense/docsearch-scraper with a very minimal vanilla JavaScript client that directly uses the browser's fetch() method to contact the Typesense API /collections/:collection/documents/search endpoint, and pass the result to an HTML template.
I'd like to avoid returning multiple results from the same web page, i.e. under different headings. Rather, only the highest ranking match on a given page should be returned. I initially used group_by: 'url', group_limit: '1' like the official typesense-docsearch.js does. However, the URL includes the anchor. This is great in many ways as it keeps client simple, but also means that grouping by URL does not prevent returning multiple results from the same page since https://example.org/foo#bar and https://example.org/foo#quux are technically different values.
Actual Behavior
document.id
Looking in the returned data.grouped_hits[#].hits[#].document object, it seemed that document.id is the only available property that is both publicly supported and unique to exactly only the document object within a collection. By "publicly supported", I mean that it can be (and is, by the official typesense-docsearch.js client) returned by asking for it via include_fields: '…, content,url,id'.
HTTP 400 Bad Request
{"message": "Cannot use `id` as a group by field."}
Upon closer inspection, I realize this is in fact not unique across portions of the same webpage. Each scraped element is stored as a separate "document".
document.url_without_anchor
This looks much more promising. However, specifying this in include_fields and group_by leads to HTTP 404 Not Found:
{"message": "Could not find a field named `url_without_anchor` in the schema."}
Description
I'm evaluating Typesense for use on api.jquery.com, qunitjs.com and other OpenJS sites by using typesense/docsearch-scraper with a very minimal vanilla JavaScript client that directly uses the browser's
fetch()
method to contact the Typesense API/collections/:collection/documents/search
endpoint, and pass the result to an HTML template.I'd like to avoid returning multiple results from the same web page, i.e. under different headings. Rather, only the highest ranking match on a given page should be returned. I initially used
group_by: 'url', group_limit: '1'
like the official typesense-docsearch.js does. However, the URL includes the anchor. This is great in many ways as it keeps client simple, but also means that grouping by URL does not prevent returning multiple results from the same page sincehttps://example.org/foo#bar
andhttps://example.org/foo#quux
are technically different values.Actual Behavior
document.id
Looking in the returned
data.grouped_hits[#].hits[#].document
object, it seemed thatdocument.id
is the only available property that is both publicly supported and unique to exactly only the document object within a collection. By "publicly supported", I mean that it can be (and is, by the official typesense-docsearch.js client) returned by asking for it viainclude_fields: '…, content,url,id'
.HTTP 400 Bad Request
Git-blame indicates this was introduced a few months ago as part of https://github.com/typesense/typesense/commit/cf908fb357ecd2e7b4933e3266d46011365637a3, although the context suggests that it was already not supported before that either.
Upon closer inspection, I realize this is in fact not unique across portions of the same webpage. Each scraped element is stored as a separate "document".
document.url_without_anchor
This looks much more promising. However, specifying this in
include_fields
andgroup_by
leads to HTTP 404 Not Found:Metadata
Typesense Version: 0.24.1.
OS: Debian 11 Bullseye.