mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
810 stars 121 forks source link

AttributeError: 'NoneType' object has no attribute 'children'. #PYTHON #131

Closed arjun251 closed 1 year ago

arjun251 commented 1 year ago

Could you please help me with this issue?

My code -

style_map = """ p[style-name='Section Title'] => h1:fresh p[style-name='Subsection Title'] => h2:fresh """ docx_file = ".../test.doc" html = mammoth.convert_to_markdown(docx_file, style_map=style_map)

Logs -

AttributeError Traceback (most recent call last)

in 7 # style_map = "strike => del" 8 docx_file = "..../test.doc" ----> 9 html = mammoth.convert_to_markdown(docx_file, style_map=style_map) /databricks/python/lib/python3.8/site-packages/mammoth/__init__.py in convert_to_markdown(*args, **kwargs) 14 15 def convert_to_markdown(*args, **kwargs): ---> 16 return convert(*args, output_format="markdown", **kwargs) 17 18 /databricks/python/lib/python3.8/site-packages/mammoth/__init__.py in convert(fileobj, transform_document, id_prefix, include_embedded_style_map, **kwargs) 24 if include_embedded_style_map: 25 kwargs["embedded_style_map"] = read_style_map(fileobj) ---> 26 return options.read_options(kwargs).bind(lambda convert_options: 27 docx.read(fileobj).map(transform_document).bind(lambda document: 28 conversion.convert_document_element_to_html( /databricks/python/lib/python3.8/site-packages/mammoth/results.py in bind(self, func) 13 14 def bind(self, func): ---> 15 result = func(self.value) 16 return Result(result.value, self.messages + result.messages) 17 /databricks/python/lib/python3.8/site-packages/mammoth/__init__.py in (convert_options) 25 kwargs["embedded_style_map"] = read_style_map(fileobj) 26 return options.read_options(kwargs).bind(lambda convert_options: ---> 27 docx.read(fileobj).map(transform_document).bind(lambda document: 28 conversion.convert_document_element_to_html( 29 document, /databricks/python/lib/python3.8/site-packages/mammoth/docx/__init__.py in read(fileobj) 29 ) 30 ---> 31 return results.combine([ 32 _read_notes(read_part_with_body, part_paths), 33 _read_comments(read_part_with_body, part_paths), /databricks/python/lib/python3.8/site-packages/mammoth/results.py in bind(self, func) 13 14 def bind(self, func): ---> 15 result = func(self.value) 16 return Result(result.value, self.messages + result.messages) 17 /databricks/python/lib/python3.8/site-packages/mammoth/docx/__init__.py in (referents) 33 _read_comments(read_part_with_body, part_paths), 34 ]).bind(lambda referents: ---> 35 _read_document(zip_file, read_part_with_body, notes=referents[0], comments=referents[1], part_paths=part_paths) 36 ) 37 /databricks/python/lib/python3.8/site-packages/mammoth/docx/__init__.py in _read_document(zip_file, read_part_with_body, notes, comments, part_paths) 125 126 def _read_document(zip_file, read_part_with_body, notes, comments, part_paths): --> 127 return read_part_with_body( 128 part_paths.main_document, 129 partial( /databricks/python/lib/python3.8/site-packages/mammoth/docx/__init__.py in read_part(name, reader, default) 170 171 if default is _undefined: --> 172 return _read_entry(zip_file, name, partial(reader, body_reader=body_reader)) 173 else: 174 return _try_read_entry_or_default(zip_file, name, partial(reader, body_reader=body_reader), default=default) /databricks/python/lib/python3.8/site-packages/mammoth/docx/__init__.py in _read_entry(zip_file, name, reader) 200 def _read_entry(zip_file, name, reader): 201 with zip_file.open(name) as fileobj: --> 202 return reader(office_xml.read(fileobj)) 203 204 /databricks/python/lib/python3.8/site-packages/mammoth/docx/document_xml.py in read_document_xml_element(element, body_reader, notes, comments) 14 15 body_element = element.find_child("w:body") ---> 16 return body_reader.read_all(body_element.children) \ 17 .map(lambda children: documents.document( 18 children, AttributeError: 'NoneType' object has no attribute 'children'
mwilliamson commented 1 year ago

From your code, it looks as though you're trying to convert a .doc document rather than a .docx document. If that's the case, then I'm afraid Mammoth only supports reading .docx documents. If it is a .docx document, please provide a minimal example so that the issue can be reproduced.

arjun251 commented 1 year ago

@mwilliamson Thanks for your response. I have another question. I have few tabular and image content in .docx file, when I converted it to .pdf I don't see the same structure as .docx in pdf.

.docx -> html -> pdf

Could you please help me with sample on working with Tabular and Image data.

style_map = """ p[style-name='Section Title'] => h1:fresh p[style-name='Subsection Title'] => h2:fresh """ docx_file = ".../test.docx" html = mammoth.convert_to_markdown(docx_file, style_map=style_map)

Thanks, Arjun S

mwilliamson commented 1 year ago

Could you post a minimal example document, the HTML you're expecting, and the HTML you're currently getting?

mwilliamson commented 1 year ago

Closing since the original issue has been addressed.