pqzx / html2docx

Convert html to docx
MIT License
69 stars 49 forks source link

UnicodeDecodeError: how to convert html documents with accented characters? #27

Open abubelinha opened 2 years ago

abubelinha commented 2 years ago

Error message:

Traceback (most recent call last):
  File "thesis.py", line 194, in <module>
    htmldocx("index-utf8.html")
  File "thesis.py", line 58, in htmldocx
   new_parser.parse_html_file("index-utf8.html", "outputfile.docx")
  File "C:\Python38\lib\site-packages\htmldocx\h2d.py", line 655, in parse_html_file
    html = infile.read()
  File "C:\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 48: character maps to <undefined>

This is my htmldocx() function:

def htmldocx(html_file, docx_file): # https://github.com/pqzx/html2docx , https://pypi.org/project/htmldocx/
    from htmldocx import HtmlToDocx
    new_parser = HtmlToDocx()
    new_parser.parse_html_file(html_file, docx_file)

This is index-utf8.html content:

<html>
<head>
<meta charset='UTF-8'>
<title>Índice</title>
</head>
<body>
<h1>Índice</h1>
<h2>Capítulo 1: Introducción</h2>
<h2>Capítulo 2: Material y métodos</h2>
<h2>Capítulo 3: Exposición</h2>
<h2>Capítulo 4: Conclusión</h2>
<h2>Apéndices</h2>
<h3>Tablas e imágenes</h3>
<h3>Bibliografía</h3>
</html>
abubelinha commented 2 years ago

I think this change may solve it: https://github.com/abubelinha/html2docx/commit/a9438edd47fa7dbc04fb54836874f1ab8ec19b21

lmanstl commented 1 year ago

I think this change may solve it: abubelinha@a9438ed

I was having this issue with an html file that contained emojis and this resolved every problem I was having with that set of files. Thanks.