All characters of search settings are safe, because they are only consisted of Latin letters. Some non-English characters only occur in title (non-English url is uncommon). So I construct a testing UTF-8 string. It is composed of Greek letter, Chinese character, Arabic letter, Emoji and so on. This unit testing will create a file named utf-8-foo and check its encoding.
Non UTF-8 file will raise lots of problems, because Pelican use UTF-8 to organize text and improve language compatibility. Pelican cannot process non UTF-8 file and will raise error using pelican-search 1.1.0:
CRITICAL Exception: Search plugin reported Error: Couldn't read file __init__.py:552
`output\search.toml`. Got error `stream did not contain valid UTF-8`
Check pelican-search 1.1.0 by this unit testing, the result shows:
============================= test session starts =============================
collecting ... collected 1 item
test_search_settings_generator.py::TestSearchSettingsGenerator::TestGenerateStorkSettings::test_output_options_encoding FAILED [100%]
test_search_settings_generator.py:143 (TestSearchSettingsGenerator.TestGenerateStorkSettings.test_output_options_encoding)
self = <tests.test_search_settings_generator.TestSearchSettingsGenerator.TestGenerateStorkSettings object at 0x000001D30FE793D0>
mocker = <pytest_mock.plugin.MockerFixture object at 0x000001D30FE94FA0>
def test_output_options_encoding(self, mocker: MockerFixture):
mocker.patch(
"pelican.plugins.search.SearchSettingsGenerator.get_input_files",
return_value=[{
"path": "content/utf-8.md",
"url": "https://blog.example.com/utf-8",
"title": "†öζ好Üس⚡⚽",
}],
)
generator = SearchSettingsGenerator(
context={},
settings={},
path=None,
theme=None,
output_path="output",
)
try:
> generator.generate_stork_settings(Path("utf-8-foo"))
test_search_settings_generator.py:161:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pelican.plugins.search.search.SearchSettingsGenerator object at 0x000001D30FE9C340>
search_settings_path = WindowsPath('utf-8-foo')
def generate_stork_settings(self, search_settings_path: Path):
self.input_options["files"] = self.get_input_files()
search_settings = {"input": self.input_options}
if self.output_options:
search_settings["output"] = self.output_options
# Write the search settings file to disk
with search_settings_path.open("w") as fd:
> rtoml.dump(obj=search_settings, file=fd)
..\pelican\plugins\search\search.py:98:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = {'input': {'base_directory': 'output', 'files': [{'path': 'content/utf-8.md', 'title': '†öζ好Üس⚡⚽', 'url': 'https://blog.example.com/utf-8'}], 'html_selector': 'main'}}
file = <_io.TextIOWrapper name='utf-8-foo' mode='w' encoding='cp936'>
def dump(obj: Any, file: Union[Path, TextIO], *, pretty: bool = False) -> int:
"""
Serialize a python object to TOML and write it to a file. `file` may be a `Path` or file object from `open()`.
If `pretty` is true, output has a more "pretty" format.
"""
s = dumps(obj, pretty=pretty)
if isinstance(file, Path):
return file.write_text(s, encoding='UTF-8')
else:
> return file.write(s)
E UnicodeEncodeError: 'gbk' codec can't encode character '\u2020' in position 155: illegal multibyte sequence
..\..\venv\lib\site-packages\rtoml\__init__.py:62: UnicodeEncodeError
During handling of the above exception, another exception occurred:
self = <tests.test_search_settings_generator.TestSearchSettingsGenerator.TestGenerateStorkSettings object at 0x000001D30FE793D0>
mocker = <pytest_mock.plugin.MockerFixture object at 0x000001D30FE94FA0>
def test_output_options_encoding(self, mocker: MockerFixture):
mocker.patch(
"pelican.plugins.search.SearchSettingsGenerator.get_input_files",
return_value=[{
"path": "content/utf-8.md",
"url": "https://blog.example.com/utf-8",
"title": "†öζ好Üس⚡⚽",
}],
)
generator = SearchSettingsGenerator(
context={},
settings={},
path=None,
theme=None,
output_path="output",
)
try:
generator.generate_stork_settings(Path("utf-8-foo"))
except Exception:
os.remove("utf-8-foo")
> raise UnicodeError
E UnicodeError
test_search_settings_generator.py:164: UnicodeError
============================== 1 failed in 0.67s ==============================
I modified search.py as what I said in #33 , unit testing passed.
Thank you for reminding me. I have not yet know well about the project. I have submitted new commit to add dependency chardet. I am sorry for my late reply. That is because of time difference to a great extent.
Pull Request Checklist
Resolves: #33
I added an unit testing to check
search.toml
encoding.All characters of search settings are safe, because they are only consisted of Latin letters. Some non-English characters only occur in
title
(non-Englishurl
is uncommon). So I construct a testing UTF-8 string. It is composed of Greek letter, Chinese character, Arabic letter, Emoji and so on. This unit testing will create a file namedutf-8-foo
and check its encoding.Non UTF-8 file will raise lots of problems, because Pelican use UTF-8 to organize text and improve language compatibility. Pelican cannot process non UTF-8 file and will raise error using pelican-search 1.1.0:
Check pelican-search 1.1.0 by this unit testing, the result shows:
I modified
search.py
as what I said in #33 , unit testing passed.