mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

not working for docx to html conversion #2

Closed surjit closed 9 years ago

surjit commented 9 years ago

mammoth sample-04.docx my.html Unsupported break type: page An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:instrText An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:fldChar An unrecognised element was ignored: w:tblPrEx An unrecognised element was ignored: w:trPr An unrecognised element was ignored: w:tblPrEx An unrecognised element was ignored: w:tblPrEx An unrecognised element was ignored: w:tblPrEx Unrecognised paragraph style: Legal notice (Style ID: Legalnotice) Unrecognised paragraph style: Title (Style ID: Title) Unrecognised paragraph style: Subtitle (Style ID: Subtitle) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised paragraph style: Contributor (Style ID: Contributor) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Subtitle (Style ID: Subtitle) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 2 (Style ID: TOC2) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: toc 1 (Style ID: TOC1) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: Legal notice (Style ID: Legalnotice) Unrecognised run style: Ref term (Style ID: Refterm) Unrecognised paragraph style: Definition Term (Style ID: DefinitionTerm0) Unrecognised paragraph style: Definition (Style ID: Definition) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Continue (Style ID: ListContinue) Unrecognised paragraph style: List Bullet 2 (Style ID: ListBullet2) Unrecognised paragraph style: List Continue 2 (Style ID: ListContinue2) Unrecognised run style: Ref term (Style ID: Refterm) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code (Style ID: Code) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Code small (Style ID: Codesmall) Unrecognised paragraph style: Example (Style ID: Example) Unrecognised paragraph style: Example (Style ID: Example) Unrecognised paragraph style: Example small (Style ID: Examplesmall) Unrecognised paragraph style: Example small (Style ID: Examplesmall) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised run style: Element (Style ID: Element) Unrecognised run style: Element (Style ID: Element) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised run style: Attribute (Style ID: Attribute) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised run style: Datatype (Style ID: Datatype) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised run style: Keyword (Style ID: Keyword) Unrecognised run style: Keyword (Style ID: Keyword) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised run style: Variable (Style ID: Variable) Unrecognised paragraph style: Ref (Style ID: Ref) Unrecognised run style: Ref term (Style ID: Refterm) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised run style: Hyperlink (Style ID: Hyperlink) Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: List Bullet (Style ID: ListBullet) Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1) Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1) root@surjit:/home/rahul# mammoth sample-04.doc my.html Traceback (most recent call last): File "/usr/local/bin/mammoth", line 100, in main() File "/usr/local/bin/mammoth", line 35, in main output_format=args.output_format, File "/usr/local/lib/python2.7/dist-packages/mammoth/init.py", line 17, in convert return docx.read(fileobj).map(transform_document).bind(lambda document: File "/usr/local/lib/python2.7/dist-packages/mammoth/docx/init.py", line 24, in read zip_file = zipfile.ZipFile(fileobj) File "/usr/lib/python2.7/zipfile.py", line 770, in init self._RealGetContents() File "/usr/lib/python2.7/zipfile.py", line 811, in _RealGetContents raise BadZipfile, "File is not a zip file" zipfile.BadZipfile: File is not a zip file

mwilliamson commented 9 years ago

The second invocation doesn't work since Mammoth only works on docx file.

For the first invocation: I can't really say why it's not working without knowing anything about the input file, or what, if anything, is written to the output file.

Does it work if you use a Word document with just a paragraph of text?

surjit commented 9 years ago

let me know, your email i will send you docx file

mwilliamson commented 9 years ago

You can use the address on my GitHub profile, hello@zwobble.org. If you could also provide the expected HTML and the actual HTML that is being generated, that would help to make sure I can reproduce what you're seeing.

mwilliamson commented 9 years ago

Thanks for sending over the file. The file seems to generate HTML successfully, could you describe what HTML you were expecting?

surjit commented 9 years ago

Why not working for me ?

surjit commented 9 years ago

i have python 2 installed ? please guide me steps to install it

surjit commented 9 years ago

if possible, pls send me generated html

mwilliamson commented 9 years ago

Why not working for me ?

Are you saying that the output file is empty? Or missing altogether?

surjit commented 9 years ago

not it showing content but some missing like page numbering and other text at footer

On Sat, Mar 28, 2015 at 9:41 PM, Michael Williamson < notifications@github.com> wrote:

Why not working for me ?

Are you saying that the output file is empty? Or missing altogether?

— Reply to this email directly or view it on GitHub https://github.com/mwilliamson/python-mammoth/issues/2#issuecomment-87255936 .

mwilliamson commented 9 years ago

Mammoth is designed to convert semantically marked up documents into sensible HTML, rather than performing a high-fidelity conversion to represent the original document as closely as possible. For instance, in general, preserving page numbering doesn't make sense in an HTML document, nor is it clear how a footer should be handled.

If you have specific suggestions on how things like footers should be handled, then I'd be happy to hear them, although I might not have much time to work on it.

If you are looking to produce HTML that resembles the original as closely as possible, I'd suggest looking for an alternative project since this is a use-case that Mammoth intentionally does not handle. If you just want to display the Word document in a web page, have you considered using Microsoft's online Office document viewer?

mwilliamson commented 9 years ago

Closing since I'm not sure there's anything else I can do to help, but feel free to open issues if you have suggestions on how specific aspects of the conversion should be handled.