UTF-8 error in generated search.toml

Tseing commented 1 year ago

[x] I have read the Filing Issues and subsequent “How to Get Help” sections of the documentation.
[x] I have searched the issues (including closed ones) and believe that this is not a duplicate.

OS version and name: Windows 10
Python version: Python 3.9.10
Pelican version: Pelican 4.8.0
Version of this plugin: pelican-search 1.1.0

Issue

Hi,

I got an unicode error when I was trying pelican-search 1.1.0.

CRITICAL Exception: Search plugin reported Error: Couldn't read file                        __init__.py:552
                    `output\search.toml`. Got error `stream did not contain valid UTF-8`

I checked output\search.toml and it is GB18030 encoding. Obviouly, pelican-search plugin generated output\search.toml with default system encoding. I speculate this error will be raised in Windows and non-English environment. I checked the code, line 97:

with search_settings_path.open("w") as fd:
            rtoml.dump(obj=search_settings, file=fd)

The open method should be modified as

with search_settings_path.open("w", encoding="utf-8") as fd:
            rtoml.dump(obj=search_settings, file=fd)

And this error is solved. It is a simple fix. Should I create a PR?

justinmayer commented 1 year ago

Hi Leo. Thank you for reporting the error. Could you perhaps start by submitting a PR that contains a failing test for this problem?

@lioman: What do you think about the error and the solution proposed here?

lioman commented 1 year ago

Sounds like the correct solution. We should definitely add a test for that.

Tseing commented 1 year ago

Hi @justinmayer, I am glad to help fix this bug. But it seems a little sticky to detect file encoding in unit test.

justinmayer commented 1 year ago

@Tseing: You tried with chardet and it did not work as you expected?

Tseing commented 1 year ago

@justinmayer Yes, I found out the problem. How Python encodes file object without encoding="utf-8" depends on language and OS environment. In lots of charsets, they shares some same encoding of basic characters. So chardet cannot detect right encoding of a file which contains only numbers, Latin letter and some basic characters like these. I manually made a UTF-8 testing string and chardet worked well.

File encoding testing needs to create a temp file and it will be deleted soon. Is that OK?

justinmayer commented 1 year ago

@Tseing: Yes, I think it's okay to temporarily create and delete a file in the context of a unit test.

pelican-plugins / search

UTF-8 error in generated search.toml #33

Issue