typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
95 stars 35 forks source link

Unexpected spaces in snippet around every character #35

Open Krinkle opened 1 year ago

Krinkle commented 1 year ago


A web page containing QUnit.test('add', shows up in search result snippets as QUnit . test ( ' add ' , assert. Take note of the unexpected spaces around virtually every symbol. I believe this is most likely a side-effect of the characters in question having <span> in the source code. However, there are no spaces in the source code around (most) of these characters.

Steps to reproduce

<code><span class="nx">QUnit</span><span class="p">.</span><span class="nx">test</span><span class="p">(</span><span class="dl">'</span><span class="s1">add</span><span class="dl">'</span><span class="p">,</span> <span class="nx">assert</span> <span class="o">=&gt;</span> <span class="p">{</span></code>

I'm evaluating Typesense for use on https://api.jquery.com, https://qunitjs.com and other OpenJS sites. I've used typesense/docsearch-scraper via GitHub Actions, and docsearch is configured with "text": "p,li,tr,pre" among the selectors. The above code is part of a regular paragraph of PRE tag.

source: typense.yaml source: /docsearch.config.json)

Expected Behavior

For inline elements like <span>, <em>, <code>, <strong> to not result in additional spaces to be injected into the indexed text. It is not uncommon for prose to sometimes emphasize, underline, strike, superscript, or otherwise wrap only part of a word in markup for any reason. It is probably most common in content with syntax-highlighted source code.


Typesense Version: 0.24.1

OS: Debian 11 Bullseye