Closed ruin1990 closed 2 years ago
You can also test in my branch ultimate-geography#translation_zh
I assume that your default encoding is GBK? (What does import sys; print(sys.getdefaultencoding())'
return?)
Is the above crash when you're using desc_zh_UTF8.html
?
The issue might be that html_file.py
doesn't specify the file encoding, so on your system where GBK is default (?), it tries to read the UTF-8 encoded desc_zh.html
as GBK and fails.
With desc_zh.html
encoded as GBK, systems with default UTF-8 crash: https://github.com/ruin1990/ultimate-geography/runs/4647875652?check_suite_focus=true (Also, reproduced locally.)
@ohare93
AFAICT the two solutions are:
I assume that your default encoding is GBK? (What does
import sys; print(sys.getdefaultencoding())'
return?)
Actually, I got UTF-8
C:\Users\Administrator\Desktop\github\ultimate-geography>python
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; print(sys.getdefaultencoding())
utf-8
Is the above crash when you're using desc_zh_UTF8.html?
Yes, I don't know why it use GBK/Unicode default.
Yes, I don't know why it use GBK/Unicode default.
Yeah, it's weird.
You could try applying the following patch to html_file.py
:
diff --git a/brain_brew/representation/generic/html_file.py b/brain_brew/representation/generic/html_file.py
index 283ca41..f9db114 100644
--- a/brain_brew/representation/generic/html_file.py
+++ b/brain_brew/representation/generic/html_file.py
@@ -3,6 +3,7 @@
from brain_brew.representation.generic.source_file import SourceFile
+_encoding = "utf-8"
@dataclass
class HTMLFile(SourceFile):
@@ -18,7 +19,7 @@ def from_file_loc(cls, file_loc) -> 'HTMLFile':
return cls(file_loc)
def read_file(self):
- r = codecs.open(self.file_location, 'r')
+ r = codecs.open(self.file_location, 'r', encoding=_encoding)
self._data = r.read()
def get_data(self, deep_copy=False) -> str:
@@ -26,7 +27,7 @@ def get_data(self, deep_copy=False) -> str:
@staticmethod
def write_file(file_location, data):
- with open(file_location, "w+") as file:
+ with open(file_location, "w+", encoding=_encoding) as file:
file.write(data)
@staticmethod
(https://github.com/ohare93/brain-brew/pull/40/commits/7e9c9aaa07722d21645cbf9ff4f00a86291cd5e7)
(Obviously, BrainBrew from that branch won't work directly, since it's based on the latest BrainBrew version, not v0.3.2, like we use in AUG.)
(Note that the patch should allow building with desc_zh_UTF8.html
and will likely prevent building with desc_zh_GBK.html
.)
Yeah, it's weird.
I agree
You could try applying the following patch to html_file.py
I tried this patch and got exception traceback as follow:
INFO:root:Builder file recipes\source_to_anki.yaml is ✔ good
INFO:root:Attempting to generate Guids
INFO:root:Generate guids complete
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\Scripts\brain_brew-script.py", line 11, in <module>
load_entry_point('Brain-Brew==0.3.2', 'console_scripts', 'brain_brew')()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\main.py", line 25, in main
recipe = TopLevelBuilder.parse_and_read(recipe_file_name, verify_only)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\top_level_builder.py", line 58, in parse_and_read
return cls.from_list(recipe_data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
tasks = cls.read_tasks(data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 65, in read_tasks
task_or_tasks = [matching_task.from_repr(task_arguments)]
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\parts_builder.py", line 40, in from_repr
return cls.from_list(data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
tasks = cls.read_tasks(data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\configuration\build_config\recipe_builder.py", line 70, in read_tasks
inner_task.execute()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\brain_brew-0.3.2-py3.6.egg\brain_brew\build_tasks\deck_parts\note_model_template_from_html_files.py", line 67, in execute
raise ValueError(f"Cannot find separator {html_separator} in html file '{self.html_file.file_location}'")
ValueError: Cannot find separator
I think this is a bug in codec.open(Python 3.6.2), because my following modifications can actually run properly
brain_brew/representation/generic/html_file.py
@@ -3,6 +3,7 @@
from brain_brew.representation.generic.source_file import SourceFile
+_encoding = "utf-8"
@dataclass
class HTMLFile(SourceFile):
@@ -18,7 +19,7 @@ def from_file_loc(cls, file_loc) -> 'HTMLFile':
return cls(file_loc)
def read_file(self):
- r = codecs.open(self.file_location, 'r')
+ r = open(self.file_location, 'r', encoding=_encoding)
self._data = r.read()
def get_data(self, deep_copy=False) -> str:
Hi, I'm use brain-brew to build ultimate-geography for my translation.
brain_brew\representation\generic\html_file.py If the html file contains non-English character, it cannot be read But, I can read html with GBK properly. I think it isn't a universal way for other contributors, who use English and its similar languages OS. They may also convert this file from GBK to their OS's local language(UTF-8?). And my test file as follow: headers.zip
Error message with build ultimate-geography
I found that csv_file.py support utf-8 format.