ohare93 / brain-brew

Automated Anki flashcard creation and extraction to/from Csv
The Unlicense
89 stars 5 forks source link

UnicodeDecodeError with source_to_anki #42

Closed Astilimos closed 2 years ago

Astilimos commented 2 years ago

Hi, I'm using brain-brew to build anki-ultimate-geography for my translation. I've got Python 3.7 installed, my computer is on Windows. Trying to run brain_brew recipes/source_to_anki.yaml, I always get the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 71: character maps to <undefined>

And here's the rest of what the cmd says:


INFO:root:Builder file recipes/source_to_anki.yaml is ✔ good
INFO:root:Attempting to generate Guids
INFO:root:Generate guids complete
Traceback (most recent call last):
  File "c:\users\redacted\appdata\local\programs\python\python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\redacted\appdata\local\programs\python\python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\Scripts\brain_brew.exe\__main__.py", line 7, in <module>
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\main.py", line 25, in main
    recipe = TopLevelBuilder.parse_and_read(recipe_file_name, verify_only)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\top_level_builder.py", line 58, in parse_and_read
    return cls.from_list(recipe_data)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\recipe_builder.py", line 65, in read_tasks
    task_or_tasks = [matching_task.from_repr(task_arguments)]
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\parts_builder.py", line 40, in from_repr
    return cls.from_list(data)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\recipe_builder.py", line 17, in from_list
    tasks = cls.read_tasks(data)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\recipe_builder.py", line 63, in read_tasks
    task_or_tasks = [matching_task.from_repr(t_arg) for t_arg in task_arguments]
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\configuration\build_config\recipe_builder.py", line 63, in <listcomp>
    task_or_tasks = [matching_task.from_repr(t_arg) for t_arg in task_arguments]
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\build_tasks\deck_parts\headers_from_yaml_part.py", line 50, in from_repr
    override=HeadersOverride.from_repr(rep.override)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\build_tasks\overrides\headers_override.py", line 30, in from_repr
    deck_desc_html_file=HTMLFile.create_or_get(rep.deck_description_html_file)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\representation\generic\source_file.py", line 35, in create_or_get
    file = cls.from_file_loc(location)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\representation\generic\html_file.py", line 18, in from_file_loc
    return cls(file_loc)
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\representation\generic\html_file.py", line 14, in __init__
    self.read_file()
  File "C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\representation\generic\html_file.py", line 22, in read_file
    self._data = r.read()
  File "c:\users\redacted\appdata\local\programs\python\python37\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 71: character maps to <undefined>

If it helps, the thing I'm trying to build is on my fork

aplaice commented 2 years ago

Thanks for the bug report!

This seems to be an encoding issue (明 from src/headers/desc_zh.html is represented in UTF-8 as the four bytes 0xe6 0x98 0x8e 0x0a (e698 8e0a in hexadecimal). However, the byte 0x98 doesn't exist in CP1250, which is probably your default encoding*, so Brain Brew crashes.

* On my computer, which uses UTF-8, I don't get that crash. (Obviously, BrainBrew should work for any default/system encoding!)

I think that this bug is fixed with #40. Unfortunately, AFAICT the fix is not yet released (and also Anki Ultimate Geography still hasn't upgraded to the latest released version of Brain Brew).


In terms of working around the bug, in the meantime, you can probably patch C:\Users\redacted\.virtualenvs\anki-ultimate-geography-dldUsQk7\lib\site-packages\brain_brew\representation\generic\html_file.py by yourself (this is obviously a horrible approach, since modifying the internals of a python virtualenv is generally not a good idea (the only thing that it can mess up is Brain Brew itself, so it's not dangerous, just ugly/bad for reproducibility)):

diff --git a/brain_brew/representation/generic/html_file.py b/brain_brew/representation/generic/html_file.py
index 283ca41..f9db114 100644
--- a/brain_brew/representation/generic/html_file.py
+++ b/brain_brew/representation/generic/html_file.py
@@ -3,6 +3,7 @@

 from brain_brew.representation.generic.source_file import SourceFile

+_encoding = "utf-8"

 @dataclass
 class HTMLFile(SourceFile):
@@ -18,7 +19,7 @@ def from_file_loc(cls, file_loc) -> 'HTMLFile':
         return cls(file_loc)

     def read_file(self):
-        r = codecs.open(self.file_location, 'r')
+        r = codecs.open(self.file_location, 'r', encoding=_encoding)
         self._data = r.read()

     def get_data(self, deep_copy=False) -> str:
@@ -26,7 +27,7 @@ def get_data(self, deep_copy=False) -> str:

     @staticmethod
     def write_file(file_location, data):
-        with open(file_location, "w+") as file:
+        with open(file_location, "w+", encoding=_encoding) as file:
             file.write(data)

     @staticmethod

Aside: your recipes/source_to_anki.yaml has two extraneous headers: default header:pl.

Sorry and thanks again!

Astilimos commented 2 years ago

Thank you for the reply. It gave another error again when I tried, ValueError: Cannot find separator... anyway looking at the tracebacks that one was the same ruin1990 had in #39 (sorry for kind of ignoring that issue but the initial errors weren't the same and it was closed with a merged pull and everything so I assumed that in particular was fixed). Their fix of changing r = codecs.open(self.file_location, 'r') to r = open(self.file_location, 'r', encoding=_encoding) seems to have worked. And thank you for the heads up about the error in the recipe!

aplaice commented 2 years ago

Thanks very much for checking!

I had been hoping that the ValueError: ... ruin1990 had reported was due to their using Python3.6, and that whatever weird bug in codecs was causing the issue had been fixed. :(

I have no idea what triggers it, as I can't reproduce either the original or the new error (I don't particularly want to play around with Windows encodings etc...)

Edit: My best guess is that the ValueError is due to the way codecs.open(..., 'r') handles newlines (maybe silently converting them to \r\n on Windows?). (But then wouldn't everybody using Windows have been affected? We must have had Windows users before you and ruin1990.)

However, I don't think that there's any real reason to use codecs.open instead of open in Python3, so we can just switch to open, here. (I'll open a PR, just in case — we no longer have note_model_template_from_html_files.py in the latest BrainBrew, but a similar issue might still be present.)


sorry for kind of ignoring that issue but the initial errors weren't the same and it was closed with a merged pull and everything so I assumed that in particular was fixed

That (ignoring the closed issue) was very much the correct course of action, so no need to apologise. I had also thought that that issue was fixed. :)

ohare93 commented 2 years ago

Sorry for the inconvenience, and thank you @aplaice for the help :clap: God damn encoding issues, the bane of all fun!

I never released that small yet vital fix?! :scream: I will get it done on Monday morning!

aplaice commented 2 years ago

I never released that small yet vital fix?!

To be fair, I'm not sure if my fix actually worked. :) (Applying the same patch to BB 0.3.2 apparently didn't help; applying it to master/BB 0.3.6 might have (note_model_template_from_html_files.py is no longer present), but I have no way of knowing...)

God damn encoding issues, the bane of all fun!

Yeah, I also hate them. :)

Astilimos commented 2 years ago

To be fair, I'm not sure if my fix actually worked. :) (Applying the same patch to BB 0.3.2 apparently didn't help; applying it to master/BB 0.3.6 might have (note_model_template_from_html_files.py is no longer present), but I have no way of knowing...)

Is 0.3.2 the Ultimate Geography version? A minor note I didn't mention would be that it went with some really odd error when I pasted it but I removed the @staticmethod def write_file... portion upon noticing it wasn't present in the original and it ran with ValueError:... afterwards

aplaice commented 2 years ago

Yeah, 0.3.2 is the UG version!

ohare93 commented 2 years ago

New version released :+1: Thanks for reporting, let me know if something is still wrong