ic-labs / django-icekit

GLAMkit is a next-generation Python CMS by the Interaction Consortium, designed especially for the cultural sector.
http://glamkit.com
MIT License
47 stars 11 forks source link

Ensure Text content paragraphs remain separate after search indexing #294

Open jmurty opened 6 years ago

jmurty commented 6 years ago

I have seen a situation (with AGSA/Tarnanthi) where some text entered on a page as a Text content item (HTML behind the scenes) becomes unsearchable because separate paragraphs of text are concatenated in the text document created during search indexing.

For example, a Text content item with the following HTML content <p>This is a tricky</p><p>test</p> can get converted to This is trickytest with no whitespace between tricky and test by the default ICEkit search document template icekit/templates/search/indexes/icekit/default.txt. This would mean that subsequent searches for the words tricky and test may not find the page containing this Text content, depending on the word-stemming rules used on a site.

I think this is caused by the striptags filter used in that template, combined with HTML content generated by the Text widget without any newlines between HTML markup tags.

It can probably be best fixed by ensuring that </p> paragraph end tags generated by the Text component include a trailing newline character.