trilbymedia / grav-plugin-tntsearch

Powerful indexed-based full text search engine powered by the TNTSearch library
https://trilby.media
MIT License
61 stars 24 forks source link

Search can't find text having soft hyphens and/or ligature control characters #133

Open Moonbase59 opened 10 months ago

Moonbase59 commented 10 months ago

Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:

Screenshot 2023-11-21 at 14-43-22 Nite Radio  Läuft  ( _blog_nite-radio-laeuft ) Nite Radio

Now of course the search won’t find something like Text­datei or Text­­datei (invisible U+00AD inside), and a user cannot know how I hyphenated my text.

It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode U+200C zero-width non-joiner and U+200D zero-width joiner.

Here’s my proposal for better search:

Since we’re already "cleaning" the searched pages in getCleanContent() (in file user/plugins/tntsearch/classes/GravTNTSearch.php), we might as well remove these in-word Unicode control characters before looking for a match.

I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:

// 2023-11-21 MCH - Remove some in-word Unicode that regularly breaks searches
$problematic = [
    '/­/i', '/­/', '/­/i', '/\x{00AD}/u', // soft hyphen
    '/‍/i', '/‍/', '/‍/i', '/\x{200D}/u', // zero-width joiner
    '/‌/i', '/‌/', '/‌/i', '/\x{200C}/u', // zero-width non-joiner
];
$content = preg_replace($problematic, '', $content) ?? $content;

in getCleanContent(). As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.

I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like ­) but that shouldn’t be a problem, I think.

Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter Text­datei, Brot‌zeit or Auf‌lage (or use the invisible keys) but instead use a simple textdatei, brotzeit or auflage for searching:

Screenshot 2023-11-21 at 15-11-43 Suche Nite Radio

If there are no objections, I could prepare a pull request.

rhukster commented 10 months ago

Pull requests are always welcome. Cheers.

Moonbase59 commented 10 months ago

Done for your testing. Didn’t touch any input handlers or version numbers. Let me know!