omer-go / docx-parser-converter

Parsers to process, store and convert docx files to html and txt formats.
https://docx-parser-and-converter.readthedocs.io/en/latest/
MIT License
7 stars 0 forks source link

Is conversion of docx from wps possible ? #2

Open waghsanket opened 2 months ago

waghsanket commented 2 months ago

I am getting the following error File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_to_html\docx_processor.py", line 55, in process_docx document_parser = DocumentParser(docx_file) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 29, in init self.document_schema = self.parse() ^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 41, in parse elements = self.extract_elements() ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\document\document_parser.py", line 73, in extract_elements
elements.append(tables_parser.parse()) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\tables_parser.py", line 71, in parse rows = [TableRowParser.parse(row) for row in self.root.findall(".//w:tr", namespaces=NAMESPACE)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\tables_parser.py", line 71, in rows = [TableRowParser.parse(row) for row in self.root.findall(".//w:tr", namespaces=NAMESPACE)] ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_parser.py", line 52, in parse properties = TableRowPropertiesParser.parse(properties_element) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_properties_parser.py", line 45, in parse
trHeight=TableRowPropertiesParser.extract_row_height(trPr_element), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\tables\table_row_properties_parser.py", line 71, in extract_row_height height_element = extract_element(element, ".//w:trHeight") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\venv\Lib\site-packages\docx_parser_converter\docx_parsers\helpers\common_helpers.py", line 33, in extract_element return parent.find(path, namespaces=NAMESPACE) ^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'find'

omer-go commented 2 months ago

Yes. It should work with any docx document that follows the docx standards. What happens here is that the parser assumes that this elements exists in the xml, but it's not. What I'll do is to add a check in the extract_element helper function to make sure that it gracefully handles the case where the element doesn't exist. I'll push it and make a new version available today. Let me know if you still have problems. Also, it would be helpful to see the actual file or its relevant xml files to be able to see the actual xml structure that breaks the parser. I will later add an xml extraction utility for future issues. Thanks for bringing the issue up. I hope the library works well for you so far.

waghsanket commented 2 months ago

Thanks for quick response @omer-go . Unfortunately I am getting this error on all files and connot share the word file due to security concerns

waghsanket commented 2 months ago

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?>

Sanket.WaghSanket.Wagh12024-08-29T17:20:00Z2024-08-29T17:25:52Z0100000141033-11.2.0.11516F2F6D0FDC9334C8E8257259E5916E54Ehihihihihihi111111233333
waghsanket commented 2 months ago

Above shared is minimum xml converted from wps doc file image Simple table as above is raising issue . seems to be some wrong format of file perhaps

omer-go commented 2 months ago

It seems to me that it's a problem of defaults while creating a file in WPS. Each platform (MS Office, LibreOffice, WPS) choose the default elements to create for each component. The update I just pushed should solve the issue. I will need to release a new version on Pypi for update to take effect but you can clone the updated code directly from the repo here right now and let me know if it solves the issue for you (it's a small fix in one file).

omer-go commented 2 months ago

I will release a newer version later today when I get a chance

waghsanket commented 2 months ago

html_output = converter.convert_to_html() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\docx_to_html_converter.py", line 46, in convert_to_html return HtmlGenerator.generate_html(self.document_schema, self.numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\html_generator.py", line 37, in generate_html body_html = HtmlGenerator.generate_html_body(document_schema.doc_margins, document_schema.elements, numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\html_generator.py", line 73, in generate_html_body table_html = TableConverter.convert_table(element) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 46, in convert_table rows_html = TableConverter.convert_rows(table.rows, table.properties.tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 134, in convert_rows row_html = TableConverter.convert_row(row, tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 164, in convert_row cells_html = TableConverter.convert_cells(row.cells, tblCellMar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\table_converter.py", line 226, in convert_cells paragraph_html = ParagraphConverter.convert_paragraph(paragraph, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\paragraph_converter.py", line 37, in convert_paragraph paragraph_html += NumberingConverter.convert_numbering(paragraph, numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\numbering_converter.py", line 44, in convert_numbering numbering_level = NumberingConverter.get_numbering_level(numbering_schema, numbering.numId, numbering.ilvl) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\sanket.wagh\Desktop\doc converter\docx_parser_converter\docx_to_html\converters\numbering_converter.py", line 131, in get_numbering_level instance = next((inst for inst in numbering_schema.instances if inst.numId == numId), None) ^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'instances'

waghsanket commented 2 months ago

facing this issue now in th actual file

omer-go commented 2 months ago

Ok, the numbering schema is sometimes saved in other xml files that are not standard (like template xmls that this library currently does not support). I can have a quick fix later, but I'll ask you to run something later to extract the numbering schema and the numbering.xml files so we can get a better look at the problem (I understand the security reasons, but the actual content will only be in the document.xml file, so the numbering.xml should be ok to upload since it only contains the schema). I'll try to provide some instructions on how to do that later today.

omer-go commented 2 months ago

@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.

from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml

# Path to your DOCX file
docx_path = "path/to/your/document.docx"

# Extract XML content from the DOCX file
extract_docx_xml(docx_path)

In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.

waghsanket commented 2 months ago

I have converted but I cannot upload any file due to restrictions .:(

omer-go commented 2 months ago

@waghsanket try the new version and tell me if it works for you.

ZeeshanImperium commented 1 month ago

Having this issue:

Extracted: docProps/custom.xml to D:\sharjeel\demo\DID-1287-Acadia contract blue\docProps_custom.xml.txt Warning: Failed to parse numbering.xml. Using default numbering schema. Error: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' Traceback (most recent call last): File "D:\sharjeel\demo\py-docx.py", line 13, in converter = DocxToHtmlConverter(docx_file_content, use_default_values=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_to_html\docx_to_html_converter.py", line 29, in init
self.document_schema, self.styles_schema, self.numbering_schema = DocxProcessor.process_docx(docx_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_to_html\docx_processor.py", line 61, in process_docx style_merger = StyleMerger(document_schema, styles_schema, numbering_schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 49, in init self.merge_styles() File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 93, in merge_styles
self.merge_paragraph_styles(element) File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 108, in merge_paragraph_styles self.apply_numbering_properties(paragraph) File "D:\sharjeel\demo.venv\Lib\site-packages\docx_parser_converter\docx_parsers\styles\styles_merger.py", line 135, in apply_numbering_properties numbering_instance = next((instance for instance in self.numbering_schema.instances if instance.numId == num_id), None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'instances'

### Code: from docx_parser_converter.docx_to_html.docx_to_html_converter import ( DocxToHtmlConverter, )

from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path

docx_path = r"D:\sharjeel\demo\Hello.docx" html_output_path = r"D:\sharjeel\demo\py.html" extract_docx_xml(docx_path) docx_file_content = read_binary_from_file_path(docx_path)

converter = DocxToHtmlConverter(docx_file_content, use_default_values=True) html_output = converter.convert_to_html() converter.save_html_to_file(html_output, html_output_path)

### Version: docx-parser-converter==0.5.1.2

omer-go commented 1 month ago

@ZeeshanImperium - look at my reply here. Please follow the instructions and copy and paste the contents of the numbering.xml file to this thread so I can see its structure.

@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.

from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml

# Path to your DOCX file
docx_path = "path/to/your/document.docx"

# Extract XML content from the DOCX file
extract_docx_xml(docx_path)

In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.

ZeeshanImperium commented 1 month ago

@ZeeshanImperium - look at my reply here. Please follow the instructions and copy and paste the contents of the numbering.xml file to this thread so I can see its structure.

@waghsanket you should run this code snippet to extract the xml files from the docx. The files will be saved in a folder where the docx file is saved. Then, please upload the numbering.xml file and also add the full list of all xml files extracted so we'll know the structure. Again, the only important content will be in the document.xml (the text content of the document) so don't share that file.

from docx_parser_converter.docx_parsers.helpers.docx_xml_list import extract_docx_xml

# Path to your DOCX file
docx_path = "path/to/your/document.docx"

# Extract XML content from the DOCX file
extract_docx_xml(docx_path)

In the meantime. I'll add another patch to handle the exception gracefully (v0.5.1.2). What will happen is that you'll get bullet points for each numbering level that was not processed properly, but at least it keeps the structure of the numbering and doesn't ignore it entirely. Let me know if that works.

It makes this file: word_numbering.xml.txt Extracted: word/numbering.xml to D:\sharjeel\demo\msasd\word_numbering.xml.txt print output of the extract function ^

word_numbering.xml.txt content: <?xml version="1.0" ?>

omer-go commented 1 month ago

@ZeeshanImperium thanks for uploading the content. I also downloaded the WPS suite to create and debug a WPS-created docx document myself. What's happening here is that WPS is creating "invisible" numbered paragraphs with no level information, which is not according to the prNum schema. The object that is created is lacking some information that the style merging class is trying to use and apply and is failing. I will work on patching this when I get a chance and release an updated version. I also noticed that tables are not converted properly so I will try to fix that as well.

ZeeshanImperium commented 1 month ago

@omer-go Thanks for your quick replies. I am converting Ms Word created docx file to html. The code worked on simple docx file but not with complex files that have images,headers,etc. The other thing is that the code converts all the text to

tags where as I want the headings to be in tags and list to be in ordered and unordered list in html respectively and It also doesn't pick background color. Overall the is the best and quickest converter. Kudos to you dude. Here is the simple word file for which the code worked:

Screenshot 2024-10-17 130137

and the converted html:

<html>

<body>
    <div
        style="padding-top:72.0pt; padding-right:72.0pt; padding-bottom:72.0pt; padding-left:72.0pt; padding-top:36.0pt; padding-bottom:36.0pt;">
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
                style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
                style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Zeeshan</span>
        </p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
                style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
                style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Ahmed
            </span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;"><span
                style="font-family:Symbol;"></span><span style="padding-left:7.2pt;"></span><span
                style="font-weight:bold;font-style:italic;text-decoration:underline;font-family:ADLaM Display;font-size:24.0pt;">Hamza</span>
        </p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">1.<span
                style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Zeeshan</span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">2.<span
                style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Ahmed </span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">3.<span
                style="padding-left:7.2pt;"></span><span style="color:C1E4F5;font-size:11.0pt;">Hamza</span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">a.<span
                style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Zeeshan</span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">b.<span
                style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Ahmed </span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;text-indent:-18.0pt;">c.<span
                style="padding-left:7.2pt;"></span><span style="color:FF0000;font-size:11.0pt;">Hamza</span></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;"></p>
        <p style="margin-top:18.0pt;margin-bottom:4.0pt;line-height:12.95pt;"><span
                style="font-weight:bold;font-style:italic;text-decoration:underline;color:0F9ED5;font-family:Arial Black;font-size:24.0pt;">Hello->Heading
                1</span>
        </p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;margin-left:36.0pt;"></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;"></p>
        <p style="margin-bottom:8.0pt;line-height:12.95pt;"></p>
    </div>
</body>

</html>
omer-go commented 1 month ago

@ZeeshanImperium - thanks for your note. I'm happy you're finding this project useful.

Let's unpack some of the things you mentioned:

The code worked on simple docx file but not with complex files that have images,headers,etc.

That's true and it's also stated in the README file:

Unsupported Components Images: Parsing and extraction of images embedded within the document. Headers and Footers: Parsing of headers and footers content. Footnotes and Endnotes: Handling footnotes and endnotes within the document. Comments: Extraction and handling of comments. Custom XML Parts: Any custom XML parts beyond the standard DOCX schema.

These elements were not required for my current project at the moment, but I might extend it in the future and support it when I get some time. In any case, contributions to the project are welcome if people want to help support it before I get to it.

The other thing is that the code converts all the text to tags where as I want the headings to be in tags and list to be in ordered and unordered list in html respectively

This is designed to create a WYSIWYG conversion of the docx file. Using ordered and unordered html tags will render the numbering incorrectly. For some use cases it doesn't matter, for others it does. The downside it that a reverse conversion of the rendered html back to docx is more challenging but it's not a use case we are currently concerned about.

It also doesn't pick background color

You're right, that's a styling component that should be supported, I'll add it to the todo list.