UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 443: character maps to <undefined>

1kastner commented 2 years ago

Thank you very much for the project and your effort! I ran .\docs\make.bat simplepdf on one of my projects and just got this decode error here:

Traceback (most recent call last):
  File "C:\Users\...\lib\site-packages\sphinx\cmd\build.py", line 277, in build_main
    app.build(args.force_all, filenames)
  File "C:\Users\...\lib\site-packages\sphinx\application.py", line 349, in build
    self.builder.build_update()
  File "C:\Users\...\lib\site-packages\sphinx\builders\__init__.py", line 298, in build_update
    self.build(['__all__'], to_build)
  File "C:\Users\...\lib\site-packages\sphinx\builders\__init__.py", line 370, in build
    self.finish()
  File "C:\Users\...\lib\site-packages\sphinx_simplepdf\builders\simplepdf.py", line 62, in finish
    index_html = "".join(index_file.readlines())
  File "C:\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 443: character maps to <undefined>

I thought I just used English characters in Unicode but obviously I am mistaken. But the conversion happens so lately that I have no idea which file is at fault. Furthermore, in HTML everything looks fine so I would be most happy if the PDF was just built without any change.

1kastner commented 2 years ago

My propositions are twofold:

Show in which line of which text file the encoding error happens. In https://github.com/useblocks/sphinx-simplepdf/blob/main/sphinx_simplepdf/builders/simplepdf.py#L62, maybe this code could be more defensive or show more hints about what went wrong?
Allow to set the encoding for all files that are opened when configuring the module.

That would be great! Thanks a lot for your time and effort!

danwos commented 2 years ago

Thanks for reporting. Looks like my code for manipulating some internal HTML data is not well written and does not really handle encoding.

I will check this and add some tests.

1kastner commented 2 years ago

Encodings are always tricky! I try to always use UTF8 but Windows defaults to that CP1252 encoding which also appears in the trace.

danwos commented 2 years ago

Ok, I added a small fix to the main branch. It sets encoding always to utf-8 for the file operations.

Thanks for the hint with the default value and the different selection on windows.

I need to find some time to set up the CI for this project and test it on windows as well.

If you like, you could checkout the main branch for me and test it again :)

1kastner commented 2 years ago

Thanks, the encoding issue is solved and at the end of the process, a PDF shows up! To be honest, there are some issues with the PDF. However, this specific issue is solved. Thanks for the quick fix!

sachin-suresh-rapyuta commented 1 year ago

@1kastner - are you facing any issues while translating the PDFs, like some of the Unicode characters not getting printed on the PDF? Is this issue talking on similar lines?

1kastner commented 1 year ago

Nope, this closed issue is on file-level i/o.

useblocks / sphinx-simplepdf

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 443: character maps to <undefined> #7