typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Unexpected spaces in snippet around every character #35

Open Krinkle opened 1 year ago

Krinkle commented 1 year ago

Description

A web page containing QUnit.test('add', shows up in search result snippets as QUnit . test ( ' add ' , assert. Take note of the unexpected spaces around virtually every symbol. I believe this is most likely a side-effect of the characters in question having <span> in the source code. However, there are no spaces in the source code around (most) of these characters.

Steps to reproduce

<code><span class="nx">QUnit</span><span class="p">.</span><span class="nx">test</span><span class="p">(</span><span class="dl">'</span><span class="s1">add</span><span class="dl">'</span><span class="p">,</span> <span class="nx">assert</span> <span class="o">=&gt;</span> <span class="p">{</span></code>

I'm evaluating Typesense for use on https://api.jquery.com, https://qunitjs.com and other OpenJS sites. I've used typesense/docsearch-scraper via GitHub Actions, and docsearch is configured with "text": "p,li,tr,pre" among the selectors. The above code is part of a regular paragraph of PRE tag.

source: typense.yaml source: /docsearch.config.json)

Expected Behavior

For inline elements like <span>, <em>, <code>, <strong> to not result in additional spaces to be injected into the indexed text. It is not uncommon for prose to sometimes emphasize, underline, strike, superscript, or otherwise wrap only part of a word in markup for any reason. It is probably most common in content with syntax-highlighted source code.

Metadata

Typesense Version: 0.24.1

OS: Debian 11 Bullseye