Missing index.st (search plugin breaks parsing titles with `"` or `\`)

maphew commented 1 year ago

[x] I have searched the issues (including closed ones) and believe that this is not a duplicate.

Issue

When and how does the index get built? On two systems, Windows and Ubuntu, I've followed the instructions and verfied that stork is installed and in PATH.

$ stork --version
Stork 1.6.0

...then launched pelican with pelican --autoreload --listen, and in the resultant preview browser window typed text in the search box.

The web page says "Error! Check the browser console.". Browser console says "Uncaught (in promise) undefined ". Shell console says:

Done: Processed 69 articles, 0 drafts, 0 hidden articles, 1 page, 0 hidden pages and 0 draft pages in 8.83 seconds.
Unable to find `/search-index.st` or variations:
/search-index.st.html
/search-index.st/index.html
/search-index.st

I've searched the file system for those files and they don't exist, so it looks like the index is not being built. How do I test and/or ensure the index is being built?

maphew commented 1 year ago

After reading #12 I added search to pelicanconf.py PLUGINS:

PLUGINS = ['m.htmlsanity', 'search']

now Pelican exits immediately with the error:

 CRITICAL Exception: Search plugin reported Error: Couldn't read the          __init__.py:566
                    configuration file: Cannot parse config as TOML. Stork recieved
                    error: `expected newline, found an identifier at line 288 column
                    11`

The only toml file in the file system for my site is output/search.toml. Line 288 is:

title = ""dir \*1" returns unexpected files"

I could see \* being parsed improperly, but it's not at position 11. Likewise the double quotes, but they're not at that position either. d is at 11 if zero based, i if 1 based.

Lines 285 to 293 are:

[[input.files]]
path = "Other/dir_1_returns_unexpected_files.html"
url = "/Other/dir_1_returns_unexpected_files.html"
title = ""dir \*1" returns unexpected files"

[[input.files]]
path = "Linux/Fix_for_broken_wireless_after_suspend_resume.html"
url = "/Linux/Fix_for_broken_wireless_after_suspend_resume.html"
title = "Fix for broken wireless after suspend/resume"

maphew commented 1 year ago

Ahhh, I realised the d at position 11 is correct. The parser thinks there should be no more chars on the line as it's just processed what it thinks is a closing quote at position 10.

Below is the source markdown that's tripping it up. If I remove this post then pelican-search works, generating the index.st etc file I was missing, and using search input box in the output web pages return results.

---
title: "dir \*1" returns unexpected files
date:  17.10.2009
category: Other
tags:  other, cmd
summary: an unexpected quirk of `dir` in Windows CMD
---

These source lines crash pelican-search:

title: "dir \*1" returns unexpected files
title: `dir \*1` returns unexpected files
title: "'dir *1' returns unexpected files"

These source lines are okay:

title: 'dir *1' returns unexpected files
title: dir \\*1 returns unexpected files

justinmayer commented 1 year ago

I’m away from my desk at the moment, but I suspect that error is being returned by Stork itself and not the plugin.

maphew commented 1 year ago

Yes it does seem to be stork itself:

» stork build -i output\search.toml -o x
Error: Couldn't read the configuration file: Cannot parse config as TOML. Stork recieved error: `expected newline, found an identifier at line 288 column 11`

maphew commented 1 year ago

Refering to https://github.com/pelican-plugins/search/blob/8e0d75d9a805552121b187299633dd0155ecd32b/pelican/plugins/search/search.py#L76

title = {dumps(striptags(page.title))} might do the trick?

It seems to work in this standalone snippet anyway:

from jinja2.filters import do_striptags as striptags
from json import dumps

X1 = r'''""dir *1" returns unexpected files"'''
X2 = r'''`dir \*1` returns unexpected files'''
X3 = r'''"'dir *1' returns unexpected files"'''

def test(page_title):
    input_file = f"""
        [[input.files]]
        title = {dumps(striptags(page_title))}
    """
    return input_file

for x in (X1,X2,X3):
    print( test(x) )

I got that json idea from https://stackoverflow.com/questions/17941109/escaping-quotes-in-jinja2 which was then then reinforced by chatgpt.

justinmayer commented 1 year ago

Can you try your change locally and see whether your fix works? That would entail temporarily uninstalling the current release package, cloning the repo, making your changes to it, and using Pip's "editable install" function to install the plugin:

python -m pip uninstall pelican-search
git clone https://github.com/pelican-plugins/search.git ~/pelican-plugins/search
# [Make the code changes]
python -m pip install -e ~/pelican-plugins/search/

maphew commented 1 year ago

Yes this does fix the problem on my machine! PR coming

maphew commented 1 year ago

Closing since either of accepted PR #15 or the alternative proposal of #23 will address the issue.

pelican-plugins / search

Missing index.st (search plugin breaks parsing titles with `"` or `\`) #20

Issue