ohare93 / brain-brew

Automated Anki flashcard creation and extraction to/from Csv
The Unlicense
89 stars 5 forks source link

html_file.py cannot read utf-8 file in chinese windows #39

Closed ruin1990 closed 2 years ago

ruin1990 commented 2 years ago

Hi, I'm use brain-brew to build ultimate-geography for my translation.

brain_brew\representation\generic\html_file.py If the html file contains non-English character, it cannot be read But, I can read html with GBK properly. I think it isn't a universal way for other contributors, who use English and its similar languages OS. They may also convert this file from GBK to their OS's local language(UTF-8?). And my test file as follow: headers.zip

Error message with build ultimate-geography

INFO:root:Builder file recipes/source_to_anki.yaml is ✔ good
INFO:root:Attempting to generate Guids
INFO:root:Generate guids complete
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\Scripts\brain_brew-script.py",
>
    load_entry_point('Brain-Brew==0.3.2', 'console_scripts', 'brain_brew')()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\main.py", line 25, in main
    recipe = TopLevelBuilder.parse_and_read(recipe_file_name, verify_only)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\top_level_builder.py", line 58, in parse_and_read
    return cls.from_list(recipe_data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\recipe_builder.py", line 65, in read_tasks
    task_or_tasks = [matching_task.from_repr(task_arguments)]
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\parts_builder.py", line 40, in from_repr
    return cls.from_list(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\recipe_builder.py", line 63, in read_tasks
    task_or_tasks = [matching_task.from_repr(t_arg) for t_arg in task_arguments]
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\configuration\build_config\recipe_builder.py", line 63, in <listcomp>
    task_or_tasks = [matching_task.from_repr(t_arg) for t_arg in task_arguments]
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\build_tasks\deck_parts\headers_from_yaml_part.py", line 50, in from_repr
    override=HeadersOverride.from_repr(rep.override)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\build_tasks\overrides\headers_override.py", line 30, in from_repr
    deck_desc_html_file=HTMLFile.create_or_get(rep.deck_description_html_file)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\representation\generic\source_file.py", line 35, in create_or_get
    file = cls.from_file_loc(location)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\representation\generic\html_file.py", line 18, in from_file_loc
    return cls(file_loc)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\representation\generic\html_file.py", line 14, in __init__
    self.read_file()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0
_brew\representation\generic\html_file.py", line 22, in read_file
    self._data = r.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa1 in position 157: illegal multibyte sequence

I found that csv_file.py support utf-8 format.

ruin1990 commented 2 years ago

You can also test in my branch ultimate-geography#translation_zh

aplaice commented 2 years ago

I assume that your default encoding is GBK? (What does import sys; print(sys.getdefaultencoding())' return?)

Is the above crash when you're using desc_zh_UTF8.html?


The issue might be that html_file.py doesn't specify the file encoding, so on your system where GBK is default (?), it tries to read the UTF-8 encoded desc_zh.html as GBK and fails.

With desc_zh.html encoded as GBK, systems with default UTF-8 crash: https://github.com/ruin1990/ultimate-geography/runs/4647875652?check_suite_focus=true (Also, reproduced locally.)


@ohare93

AFAICT the two solutions are:

  1. Enforce utf-8 for HTML just like for JSON and CSV (and only use UTF-8-encoded files). (Easy!)
  2. Allow specifying encoding for files in the YAML recipes. (Messy!)
ruin1990 commented 2 years ago

I assume that your default encoding is GBK? (What does import sys; print(sys.getdefaultencoding())' return?)

Actually, I got UTF-8

C:\Users\Administrator\Desktop\github\ultimate-geography>python
Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; print(sys.getdefaultencoding())
utf-8

Is the above crash when you're using desc_zh_UTF8.html?

Yes, I don't know why it use GBK/Unicode default.

aplaice commented 2 years ago

Yes, I don't know why it use GBK/Unicode default.

Yeah, it's weird.

You could try applying the following patch to html_file.py:

diff --git a/brain_brew/representation/generic/html_file.py b/brain_brew/representation/generic/html_file.py
index 283ca41..f9db114 100644
--- a/brain_brew/representation/generic/html_file.py
+++ b/brain_brew/representation/generic/html_file.py
@@ -3,6 +3,7 @@

 from brain_brew.representation.generic.source_file import SourceFile

+_encoding = "utf-8"

 @dataclass
 class HTMLFile(SourceFile):
@@ -18,7 +19,7 @@ def from_file_loc(cls, file_loc) -> 'HTMLFile':
         return cls(file_loc)

     def read_file(self):
-        r = codecs.open(self.file_location, 'r')
+        r = codecs.open(self.file_location, 'r', encoding=_encoding)
         self._data = r.read()

     def get_data(self, deep_copy=False) -> str:
@@ -26,7 +27,7 @@ def get_data(self, deep_copy=False) -> str:

     @staticmethod
     def write_file(file_location, data):
-        with open(file_location, "w+") as file:
+        with open(file_location, "w+", encoding=_encoding) as file:
             file.write(data)

     @staticmethod

(https://github.com/ohare93/brain-brew/pull/40/commits/7e9c9aaa07722d21645cbf9ff4f00a86291cd5e7)

(Obviously, BrainBrew from that branch won't work directly, since it's based on the latest BrainBrew version, not v0.3.2, like we use in AUG.)

(Note that the patch should allow building with desc_zh_UTF8.html and will likely prevent building with desc_zh_GBK.html.)

ruin1990 commented 2 years ago

Yeah, it's weird.

I agree

You could try applying the following patch to html_file.py

I tried this patch and got exception traceback as follow:

INFO:root:Builder file recipes\source_to_anki.yaml is ✔ good
INFO:root:Attempting to generate Guids
INFO:root:Generate guids complete
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\Scripts\brain_brew-script.py", line 11, in <module>
    load_entry_point('Brain-Brew==0.3.2', 'console_scripts', 'brain_brew')()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\main.py", line 25, in main
    recipe = TopLevelBuilder.parse_and_read(recipe_file_name, verify_only)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\top_level_builder.py", line 58, in parse_and_read
    return cls.from_list(recipe_data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 65, in read_tasks
    task_or_tasks = [matching_task.from_repr(task_arguments)]
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\parts_builder.py", line 40, in from_repr
    return cls.from_list(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 70, in read_tasks
    inner_task.execute()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\build_tasks\deck_parts\note_model_template_from_html_files.py", line 67, in execute
    raise ValueError(f"Cannot find separator {html_separator} in html file '{self.html_file.file_location}'")
ValueError: Cannot find separator

I think this is a bug in codec.open(Python 3.6.2), because my following modifications can actually run properly

brain_brew/representation/generic/html_file.py
@@ -3,6 +3,7 @@

 from brain_brew.representation.generic.source_file import SourceFile

+_encoding = "utf-8"

 @dataclass
 class HTMLFile(SourceFile):
@@ -18,7 +19,7 @@ def from_file_loc(cls, file_loc) -> 'HTMLFile':
         return cls(file_loc)

     def read_file(self):
-        r = codecs.open(self.file_location, 'r')
+        r = open(self.file_location, 'r', encoding=_encoding)
         self._data = r.read()

     def get_data(self, deep_copy=False) -> str: