pelican-plugins / search

Pelican plugin that adds site search capability
47 stars 9 forks source link

Write `search.toml` with UTF-8 encoding #34

Closed Tseing closed 1 year ago

Tseing commented 1 year ago

Pull Request Checklist

Resolves: #33

I added an unit testing to check search.toml encoding.

def test_output_options_encoding(self, mocker: MockerFixture):
    mocker.patch(
        "pelican.plugins.search.SearchSettingsGenerator.get_input_files",
        return_value=[
            {
                "path": "content/utf-8.md",
                "url": "https://blog.example.com/utf-8",
                "title": "†öζ好Üس⚡⚽",
            }
        ],
    )
    generator = SearchSettingsGenerator(
        context={},
        settings={},
        path=None,
        theme=None,
        output_path="output",
    )
    try:
        generator.generate_stork_settings(Path("utf-8-foo"))
    except Exception:
        os.remove("utf-8-foo")
        raise UnicodeError
    else:
        with open(Path("utf-8-foo"), "rb") as f:
            detect = chardet.detect(f.read())
        os.remove("utf-8-foo")
        assert detect["encoding"] == "utf-8"

All characters of search settings are safe, because they are only consisted of Latin letters. Some non-English characters only occur in title (non-English url is uncommon). So I construct a testing UTF-8 string. It is composed of Greek letter, Chinese character, Arabic letter, Emoji and so on. This unit testing will create a file named utf-8-foo and check its encoding.

Non UTF-8 file will raise lots of problems, because Pelican use UTF-8 to organize text and improve language compatibility. Pelican cannot process non UTF-8 file and will raise error using pelican-search 1.1.0:

CRITICAL Exception: Search plugin reported Error: Couldn't read file                        __init__.py:552
                    `output\search.toml`. Got error `stream did not contain valid UTF-8`

Check pelican-search 1.1.0 by this unit testing, the result shows:

============================= test session starts =============================
collecting ... collected 1 item

test_search_settings_generator.py::TestSearchSettingsGenerator::TestGenerateStorkSettings::test_output_options_encoding FAILED [100%]
test_search_settings_generator.py:143 (TestSearchSettingsGenerator.TestGenerateStorkSettings.test_output_options_encoding)
self = <tests.test_search_settings_generator.TestSearchSettingsGenerator.TestGenerateStorkSettings object at 0x000001D30FE793D0>
mocker = <pytest_mock.plugin.MockerFixture object at 0x000001D30FE94FA0>

    def test_output_options_encoding(self, mocker: MockerFixture):
        mocker.patch(
            "pelican.plugins.search.SearchSettingsGenerator.get_input_files",
            return_value=[{
                "path": "content/utf-8.md",
                "url": "https://blog.example.com/utf-8",
                "title": "†öζ好Üس⚡⚽",
            }],
        )
        generator = SearchSettingsGenerator(
            context={},
            settings={},
            path=None,
            theme=None,
            output_path="output",
        )
        try:
>           generator.generate_stork_settings(Path("utf-8-foo"))

test_search_settings_generator.py:161: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pelican.plugins.search.search.SearchSettingsGenerator object at 0x000001D30FE9C340>
search_settings_path = WindowsPath('utf-8-foo')

    def generate_stork_settings(self, search_settings_path: Path):
        self.input_options["files"] = self.get_input_files()

        search_settings = {"input": self.input_options}

        if self.output_options:
            search_settings["output"] = self.output_options

        # Write the search settings file to disk
        with search_settings_path.open("w") as fd:
>           rtoml.dump(obj=search_settings, file=fd)

..\pelican\plugins\search\search.py:98: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = {'input': {'base_directory': 'output', 'files': [{'path': 'content/utf-8.md', 'title': '†öζ好Üس⚡⚽', 'url': 'https://blog.example.com/utf-8'}], 'html_selector': 'main'}}
file = <_io.TextIOWrapper name='utf-8-foo' mode='w' encoding='cp936'>

    def dump(obj: Any, file: Union[Path, TextIO], *, pretty: bool = False) -> int:
        """
        Serialize a python object to TOML and write it to a file. `file` may be a `Path` or file object from `open()`.

        If `pretty` is true, output has a more "pretty" format.
        """
        s = dumps(obj, pretty=pretty)
        if isinstance(file, Path):
            return file.write_text(s, encoding='UTF-8')
        else:
>           return file.write(s)
E           UnicodeEncodeError: 'gbk' codec can't encode character '\u2020' in position 155: illegal multibyte sequence

..\..\venv\lib\site-packages\rtoml\__init__.py:62: UnicodeEncodeError

During handling of the above exception, another exception occurred:

self = <tests.test_search_settings_generator.TestSearchSettingsGenerator.TestGenerateStorkSettings object at 0x000001D30FE793D0>
mocker = <pytest_mock.plugin.MockerFixture object at 0x000001D30FE94FA0>

    def test_output_options_encoding(self, mocker: MockerFixture):
        mocker.patch(
            "pelican.plugins.search.SearchSettingsGenerator.get_input_files",
            return_value=[{
                "path": "content/utf-8.md",
                "url": "https://blog.example.com/utf-8",
                "title": "†öζ好Üس⚡⚽",
            }],
        )
        generator = SearchSettingsGenerator(
            context={},
            settings={},
            path=None,
            theme=None,
            output_path="output",
        )
        try:
            generator.generate_stork_settings(Path("utf-8-foo"))
        except Exception:
            os.remove("utf-8-foo")
>           raise UnicodeError
E           UnicodeError

test_search_settings_generator.py:164: UnicodeError
============================== 1 failed in 0.67s ==============================

I modified search.py as what I said in #33 , unit testing passed.

Tseing commented 1 year ago

Thank you for reminding me. I have not yet know well about the project. I have submitted new commit to add dependency chardet. I am sorry for my late reply. That is because of time difference to a great extent.