Open waghsanket opened 2 months ago
Yes. It should work with any docx document that follows the docx standards. What happens here is that the parser assumes that this elements exists in the xml, but it's not. What I'll do is to add a check in the extract_element helper function to make sure that it gracefully handles the case where the element doesn't exist. I'll push it and make a new version available today. Let me know if you still have problems. Also, it would be helpful to see the actual file or its relevant xml files to be able to see the actual xml structure that breaks the parser. I will later add an xml extraction utility for future issues. Thanks for bringing the issue up. I hope the library works well for you so far.
Thanks for quick response @omer-go . Unfortunately I am getting this error on all files and connot share the word file due to security concerns
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?>
Above shared is minimum xml converted from wps doc file Simple table as above is raising issue . seems to be some wrong format of file perhaps
It seems to me that it's a problem of defaults while creating a file in WPS. Each platform (MS Office, LibreOffice, WPS) choose the default elements to create for each component. The update I just pushed should solve the issue. I will need to release a new version on Pypi for update to take effect but you can clone the updated code directly from the repo here right now and let me know if it solves the issue for you (it's a small fix in one file).
I will release a newer version later today when I get a chance
html_output = converter.convert_to_html() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\docx_to_html_converter.py", line 46, in convert_to_html return HtmlGenerator.generate_html(self.document_schema, self.numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\html_generator.py", line 37, in generate_html body_html = HtmlGenerator.generate_html_body(document_schema.doc_margins, document_schema.elements, numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\html_generator.py", line 73, in generate_html_body table_html = TableConverter.convert_table(element) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 46, in convert_table rows_html = TableConverter.convert_rows(table.rows, table.properties.tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 134, in convert_rows row_html = TableConverter.convert_row(row, tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 164, in convert_row cells_html = TableConverter.convert_cells(row.cells, tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 226, in convert_cells paragraph_html = ParagraphConverter.convert_paragraph(paragraph, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\paragraph_converter.py", line 37, in convert_paragraph paragraph_html += NumberingConverter.convert_numbering(paragraph, numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\numbering_converter.py", line 44, in convert_numbering numbering_level = NumberingConverter.get_numbering_level(numbering_schema, numbering.numId, numbering.ilvl) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\numbering_converter.py", line 131, in get_numbering_level instance = next((inst for inst in numbering_schema.instances if inst.numId == numId), None) ^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'instances'
facing this issue now in th actual file
Ok, the numbering schema is sometimes saved in other xml files that are not standard (like template xmls that this library currently does not support). I can have a quick fix later, but I'll ask you to run something later to extract the numbering schema and the numbering.xml files so we can get a better look at the problem (I understand the security reasons, but the actual content will only be in the document.xml file, so the numbering.xml should be ok to upload since it only contains the schema). I'll try to provide some instructions on how to do that later today.
@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.
from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml
# Path to your DOCX file
docx_path = "path/to/your/document.docx"
# Extract XML content from the DOCX file
extract_docx_xml(docx_path)
In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.
I have converted but I cannot upload any file due to restrictions .:(
@waghsanket try the new version and tell me if it works for you.
Having this issue:
Extracted: docProps/custom.xml to D:\sharjeel\demo\DID-1287-Acadia contract blue\docProps_custom.xml.txt
Warning: Failed to parse numbering.xml. Using default numbering schema. Error: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
Traceback (most recent call last):
File "D:\sharjeel\demo\py-docx.py", line 13, in
self.document_schema, self.styles_schema, self.numbering_schema = DocxProcessor.process_docx(docx_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_to_html\docx_processor.py", line 61, in process_docx
style_merger = StyleMerger(document_schema, styles_schema, numbering_schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 49, in init
self.merge_styles()
File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 93, in merge_styles
self.merge_paragraph_styles(element)
File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 108, in merge_paragraph_styles
self.apply_numbering_properties(paragraph)
File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 135, in apply_numbering_properties
numbering_instance = next((instance for instance in self.numbering_schema.instances if instance.numId == num_id), None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'instances'
### Code: from docx_parser_converter.docx_to_html.docx_to_html_converter import ( DocxToHtmlConverter, )
from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
docx_path = r"D:\sharjeel\demo\Hello.docx" html_output_path = r"D:\sharjeel\demo\py.html" extract_docx_xml(docx_path) docx_file_content = read_binary_from_file_path(docx_path)
converter = DocxToHtmlConverter(docx_file_content, use_default_values=True) html_output = converter.convert_to_html() converter.save_html_to_file(html_output, html_output_path)
### Version: docx-parser-converter==0.5.1.2
@ZeeshanImperium - look at my reply here. Please follow the instructions and copy and paste the contents of the numbering.xml file to this thread so I can see its structure.
@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.
from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml # Path to your DOCX file docx_path = "path/to/your/document.docx" # Extract XML content from the DOCX file extract_docx_xml(docx_path)
In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.
@ZeeshanImperium - look at my reply here. Please follow the instructions and copy and paste the contents of the numbering.xml file to this thread so I can see its structure.
@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.
from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml # Path to your DOCX file docx_path = "path/to/your/document.docx" # Extract XML content from the DOCX file extract_docx_xml(docx_path)
In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.
It makes this file: word_numbering.xml.txt Extracted: word/numbering.xml to D:\sharjeel\demo\msasd\word_numbering.xml.txt print output of the extract function ^
word_numbering.xml.txt content: <?xml version="1.0" ?>
@ZeeshanImperium thanks for uploading the content. I also downloaded the WPS suite to create and debug a WPS-created docx document myself. What's happening here is that WPS is creating "invisible" numbered paragraphs with no level information, which is not according to the prNum schema. The object that is created is lacking some information that the style merging class is trying to use and apply and is failing. I will work on patching this when I get a chance and release an updated version. I also noticed that tables are not converted properly so I will try to fix that as well.
@omer-go Thanks for your quick replies. I am converting Ms Word created docx file to html. The code worked on simple docx file but not with complex files that have images,headers,etc. The other thing is that the code converts all the text to
tags where as I want the headings to be in
and the converted html:
<html>
<body>
<div
style="padding-top:72.0pt; padding-right:72.0pt; padding-bottom:72.0pt; padding-left:72.0pt; padding-top:36.0pt; padding-bottom:36.0pt;">
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Zeeshan</span>
</p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Ahmed
</span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Hamza</span>
</p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">1.<span
style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Zeeshan</span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">2.<span
style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Ahmed </span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">3.<span
style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Hamza</span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">a.<span
style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Zeeshan</span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">b.<span
style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Ahmed </span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">c.<span
style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Hamza</span></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;"></p>
<p style="margin-top:18.0pt;margin-bottom:4.0pt;line-height:12.95pt;"><span
style="font-weight:bold;font-style:italic;text-decoration:underline;color:0F9ED5;font-family:Arial Black;font-size:24.0pt;">Hello->Heading
1</span>
</p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;"></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;"></p>
<p style="margin-bottom:8.0pt;line-height:12.95pt;"></p>
</div>
</body>
</html>
@ZeeshanImperium - thanks for your note. I'm happy you're finding this project useful.
Let's unpack some of the things you mentioned:
The code worked on simple docx file but not with complex files that have images,headers,etc.
That's true and it's also stated in the README file:
Unsupported Components Images: Parsing and extraction of images embedded within the document. Headers and Footers: Parsing of headers and footers content. Footnotes and Endnotes: Handling footnotes and endnotes within the document. Comments: Extraction and handling of comments. Custom XML Parts: Any custom XML parts beyond the standard DOCX schema.
These elements were not required for my current project at the moment, but I might extend it in the future and support it when I get some time. In any case, contributions to the project are welcome if people want to help support it before I get to it.
The other thing is that the code converts all the text to tags where as I want the headings to be in tags and list to be in ordered and unordered list in html respectively
This is designed to create a WYSIWYG conversion of the docx file. Using ordered and unordered html tags will render the numbering incorrectly. For some use cases it doesn't matter, for others it does. The downside it that a reverse conversion of the rendered html back to docx is more challenging but it's not a use case we are currently concerned about.
It also doesn't pick background color
You're right, that's a styling component that should be supported, I'll add it to the todo list.
I am getting the following error File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_to_html\docx_processor.py", line 55, in process_docx document_parser = DocumentParser(docx_file) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 29, in init self.document_schema = self.parse() ^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 41, in parse elements = self.extract_elements() ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 73, in extract_elements
rows = [TableRowParser.parse(row) for row in self.root.findall(".//w:tr", namespaces=NAMESPACE)]
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_parser.py", line 52, in parse
properties = TableRowPropertiesParser.parse(properties_element)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_properties_parser.py", line 45, in parse
elements.append(tables_parser.parse()) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\tables_parser.py", line 71, in parse rows = [TableRowParser.parse(row) for row in self.root.findall(".//w:tr", namespaces=NAMESPACE)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\tables_parser.py", line 71, in
trHeight=TableRowPropertiesParser.extract_row_height(trPr_element), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_properties_parser.py", line 71, in extract_row_height height_element = extract_element(element, ".//w:trHeight") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\helpers\common_helpers.py", line 33, in extract_element return parent.find(path, namespaces=NAMESPACE) ^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'find'