pqzx / html2docx

Convert html to docx
MIT License
69 stars 49 forks source link

Getting IndexError: list index out of range when parsing wikipedia page #38

Open alek-tech opened 2 years ago

alek-tech commented 2 years ago

wiki page: "https://en.wikipedia.org/wiki/List_of_public_corporations_by_market_capitalization"

code: from docx import Document from htmldocx import HtmlToDocx import codecs

file = codecs.open("wiki.html", "r", "utf-8") html = file.read() new_parser = HtmlToDocx() document = Document() new_parser.add_html_to_document(html, document) document.save('your_file_name.docx')

whole traceback: File "c:\Users\alek\Desktop\htmldocx wikipedia.py", line 46, in new_parser.add_html_to_document(html, document) File "C:\Users\alek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\htmldocx\h2d.py", line 591, in add_html_to_document self.run_process(html) File "C:\Users\alek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\htmldocx\h2d.py", line 583, in run_process
self.feed(html) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64qbz5n2kfra8p0\lib\html\parser.py", line 110, in feed self.goahead(0) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64qbz5n2kfra8p0\lib\html\parser.py", line 170, in goahead k = self.parse_starttag(i) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.1264.0_x64__qbz5n2kfra8p0\lib\html\parser.py", line 344, in parse_starttag self.handle_starttag(tag, attrs) File "C:\Users\alek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\htmldocx\h2d.py", line 453, in handle_starttag self.handle_table() File "C:\Users\alek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\htmldocx\h2d.py", line 337, in handle_table
docx_cell = self.table.cell(cell_row, cell_col) File "C:\Users\alek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\docx\table.py", line 81, in cell return self._cells[cell_idx] IndexError: list index out of range

Pls. help!

niraj-chapla commented 2 years ago

I am facing the same error from the method new_parser.parse_html_string(). While further drilling it down, I found that this error comes when there is a table in the HTML content being parsed, and in that table, there are some cells merged. If I remove merged cells from the table, it's working fine.

Following is the error trace I am getting. docx = new_parser.parse_html_string(html_body) File "/usr/local/lib/python3.8/dist-packages/htmldocx/h2d.py", line 617, in parse_html_string self.run_process(html) File "/usr/local/lib/python3.8/dist-packages/htmldocx/h2d.py", line 583, in run_process self.feed(html) File "/usr/lib/python3.8/html/parser.py", line 111, in feed self.goahead(0) File "/usr/lib/python3.8/html/parser.py", line 171, in goahead k = self.parse_starttag(i) File "/usr/lib/python3.8/html/parser.py", line 345, in parse_starttag self.handle_starttag(tag, attrs) File "/usr/local/lib/python3.8/dist-packages/htmldocx/h2d.py", line 453, in handle_starttag self.handle_table() File "/usr/local/lib/python3.8/dist-packages/htmldocx/h2d.py", line 337, in handle_table docx_cell = self.table.cell(cell_row, cell_col) File "/usr/local/lib/python3.8/dist-packages/docx/table.py", line 81, in cell return self._cells[cell_idx] IndexError: list index out of range