jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.07k stars 3.35k forks source link

docx: corrupted file when inserting an image #532

Closed amadawn closed 11 years ago

amadawn commented 12 years ago

Sometimes when I convert a markdown file into docx and I try to insert an image, the generated docx file is "corrupted".

This does not happen for every image. I have some example image for which it always happens though.

When I open the generated docx file on Word 2007 I get 2 error messages. First I get a dialog box that says:

"The document test.docx cannot be open because there are problems with its contents"

The dialog has a "message details" button, that when clicked shows the following information:

"The file is damaged and cannot be open"

Once I click OK, the dialog is dismissed and a second error dialog box appears which says:

"Word found unreadable content in test.docx. Do you want to recover the content of this document? If you trust the origin of this document click Yes."

Note that I've roughly translated these messages from Spanish, which is the language of the version of Word 2007 that I've tested this with.

Once I click "Yes" the word document opens fine. The image is there. However the formatting is not very good:

jgm commented 12 years ago

What version of pandoc? This sounds like an old bug that has been fixed in recent versions.

On Jun 3, 2012, at 10:57 PM, amadawnreply@reply.github.com wrote:

Whenever I convert a markdown file into docx and I try to insert an image, the generated docx file is "corrupted". When I open it on Word 2007 I get 2 error messages. First I get a dialog box that says:

"The document test.docx cannot be open because there are problems with its contents"

The dialog has a "message details" button, that when clicked shows the following information:

"The file is damaged and cannot be open"

Once I click OK, the dialog is dismissed and a second error dialog box appears which says:

"Word found unreadable content in test.docx. Do you want to recover the content of this document? If you trust the origin of this document click Yes."

Note that I've roughly translated these messages from Spanish, which is the language of the version of Word 2007 that I've tested this with.

Once I click "Yes" the word document opens fine. The image is there. However the formatting is not very good:

  • The image is much smaller than it should (i.e. for comparison, a PDF generated from the same markdown source document contains the image with its right size and proper formatting)
  • Both the image its caption are not centered (as they are on the PDF)
  • and the caption does not say: "Figure 1".

Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/532

amadawn commented 12 years ago

I used the latest version (1.9.4) which I downloaded today, but it also happens with 1.9.3.

I'm on Windows 2007, using Word 2007.

Cheers,

Angel

On Mon, Jun 4, 2012 at 4:19 PM, John MacFarlane reply@reply.github.com wrote:

What version of pandoc?  This sounds like an old bug that has been fixed in recent versions.

On Jun 3, 2012, at 10:57 PM, amadawnreply@reply.github.com wrote:

Whenever I convert a markdown file into docx and I try to insert an image, the generated docx file is "corrupted". When I open it on Word 2007 I get 2 error messages. First I get a dialog box that says:

"The document test.docx cannot be open because there are problems with its contents"

The dialog has a "message details" button, that when clicked shows the following information:

"The file is damaged and cannot be open"

Once I click OK, the dialog is dismissed and a second error dialog box appears which says:

"Word found unreadable content in test.docx. Do you want to recover the content of this document? If you trust the origin of this document click Yes."

Note that I've roughly translated these messages from Spanish, which is the language of the version of Word 2007 that I've tested this with.

Once I click "Yes" the word document opens fine. The image is there. However the formatting is not very good:

  • The image is much smaller than it should (i.e. for comparison, a PDF generated from the same markdown source document contains the image with its right size and proper formatting)
  • Both the image its caption are not centered (as they are on the PDF)
  • and the caption does not say: "Figure 1".

Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/532


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/532#issuecomment-6099929

jgm commented 12 years ago

This seems related to #414, which I've now reopened.

jgm commented 12 years ago

As for you other issues (aside from the corrupt file):

jgm commented 12 years ago

I can't reproduce the problem with Office 2011. Maybe you could attach a minimal test file (and image) that causes the problem. Are you using a custom template, by any chance?

amadawn commented 12 years ago

On Jun 4, 2012 6:37 PM, "John MacFarlane" < reply@reply.github.com> wrote:

As for you other issues (aside from the corrupt file):

I incorrectly assumed these were due to the broken file. Sorry!

  • The size of the image is determined by its size in pixels and the DPI setting encoded in the image. For example, if it's a 300px wide image and the DPI is 300, it will only be 1in wide. Try changing the DPI -- this is encoded in the image itself, and should be something you can change with the right image editor.

I guess the pdf looks fine because latex properly formats the figure, right? How does latex decide the size of the image? Maybe pandoc could do the same?

I've never explicitly changed the doing of an image. Normally I just resize it to the size I want it (in pixels).

Being able to specify the image size as in MultiMarkdown would be nice.

  • Images aren't centered in docx, currently (or in HTML for that matter).

Any plans to implement that? I think it is much more common to center images than to leave them aligned to the left... which does not look very good.

For docx documents you could set the style of both the image and the caption to 'caption' which would let the user customize how the generated document looks quite easily.

  • You only get "Figure 1." in LaTeX -- it is added automatically by latex. To get this consistently in all output formats, we'd need to extend pandoc quite a bit, including internationalization.

I don't think adding internationalization support would be necessary. Pandoc could use English labels by default, while letting the user configure those labels via some command line options or some pandoc config file.

I think that should be good enough for most users. Of course that does not mean that it would be easy to implement but perhaps a little less hard?

Angel

ryangray commented 12 years ago

Perhaps a Pandoc option to not use figure numbering in LaTeX would be useful? Since I like to output to multiple formats, this would be a way to make them the same. I checked, and the recommended way is the caption package which does not seem to be in my MikTeX distro. Another way is to simply have pandoc not put the caption text inside \caption{} but still inside the figure environment when using such an option.

mickmcq commented 11 years ago

I believe I have experienced the same issue and that it is a problem in the most recent version of pandoc.

When I add a pdf to a markdown document and convert to docx, I observe the same behavior no matter the pixel density settings or edit of headers or anything else suggested here.

Given the example file

http://www-personal.umich.edu/~mcq/test-creation.md

into which I include

http://www-personal.umich.edu/~mcq/testing3.pdf

I see

http://www-personal.umich.edu/~mcq/firsterrordisplay.pdf

followed by

http://www-personal.umich.edu/~mcq/seconderrordisplay.pdf

and am given the opportunity to open a "recovered version of the document, which has this appearance:

http://www-personal.umich.edu/~mcq/appearanceofresult.pdf

no matter whether I use pdf or tiff and with the same dimensions no matter the changes I make using ImageMagick or Apple Preview.

I manually edited the document and found that I could manually change

cx="1572000"

cx="4572000"

in two places in document.xml, and see a "stretched" version after viewing the same error messages as above. The docx file where I did that is at

http://www-personal.umich.edu/~mcq/testingpdfissue.docx

Also, I manually resized the image in that file and saved to the following file.

http://www-personal.umich.edu/~mcq/afterrecoveryplussave.docx

I tried to diff the copies of word/document.xml in these two docx files and was overwhelmed by the number of differences. I had hoped I could manually change the cx and cy values in the output of pandoc, but I'm not sure how to proceed with so many differences.

By the way, this never happens with jpg, png, or gif images and, if the quality of those images were higher, I would just use those. The problem there is that any jpg, png, or gif small enough to fit on the page upon import has an extremely low resolution. I can manually add a sharp png, jpg, or gif, but then I must go through a laborious click and drag process to resize it. I can't seem to identify a way to automate that process.

As it is, I can not create a file with a lot of images and expect to be able to export it to docx after each edit. It would greatly simplify my required export to docx if this were addressed or a workaround suggested!

jgm commented 11 years ago

@mickmcq - what pandoc version, what OS, how was pandoc installed, what version of Word?

jgm commented 11 years ago

@mickmcq - sorry, I see you specified everything except the Word version in the markdown file.

jgm commented 11 years ago

@mickmcq - I was able to reproduce this on my Mac, with Word 2011.

mickmcq commented 11 years ago

I have Word for Mac 2011 Version 14.3.1 (130117)

It may be worth mentioning that this is on a new Macbook Air with a managed software setup, running Mac OS X 10.8.2, but previously I did it on my old Macbook Air running 10.7.5 with an install from the cabal method you describe on your All Platforms section at http://johnmacfarlane.net/pandoc/installing.html.

Thanks for looking into this! For the most part, Pandoc has rescued me from the tyranny of high officials insisting on many last-minute edits to documents in Microsoft Word format. Illustrations are the only remaining obstacle.

On Sat, Feb 23, 2013 at 11:35 PM, John MacFarlane notifications@github.comwrote:

@mickmcq https://github.com/mickmcq - I was able to reproduce this on my Mac, with Word 2011.

— Reply to this email directly or view it on GitHubhttps://github.com/jgm/pandoc/issues/532#issuecomment-14003364.

Mick McQuaid Data Security Analyst, Senior Information and Infrastructure Assurance 734-647-9550 mcq@umich.edu

jgm commented 11 years ago

Looking inside the file Word saves after it rescues the corrupted file, I see one major difference. The image file saved in word is not a PDF. It is media/image1.emf (enhanced Windows metafile format). Looking inside the file, it looks as if it contains parts that are the same as in the PDF, but differs at least in the header. So perhaps Word cannot handle PDFs without some kind of special treatment/conversion to another format? IT also contains

<Default Extension="emf" ContentType="image/x-emf"/>

in the [Content Types] file. (Note: I tried adding a default for PDFs to the reference.docx, which previously did not contain one, but this didn't help.)

You could try converting your PDF to EMF using one of several programs that appear on Google for that task. Then use the emf instead of the pdf for the image source in your markdown file. You'll probably also have to add the Default tag mentioned above to the [Content Types] file in the docx container. I'd be curious if that works.

If that works, a further step would be to figure out what is involved in converting a PDF to EMF. IF it is a fairly simple process, maybe pandoc could be taught to do it automatically.

jgm commented 11 years ago

In github master, I've added the Default tag mentioned above to the reference.docx. No luck switching the pdf for an emf, though.

jgm commented 11 years ago

For using jpg images, you should be able to solve your size problems by adding DPI information to the image file. (This is something a good image editor should allow you to do; it can probably also be done programatically with ImageMagick or comparable tools.)

IF you want high resolution, say, 600x600 dpi, you could use a 2400x2400 pixel image and specify 600 dpi; this should produce a 4 inch square image in Word with high resolution.

jgm commented 11 years ago

OK, I think the emf issue was a red herring.

With commit cae409725fa0a71dd3f9ce2ea2185b050f5d1cb0, I can convert your sample markdown file to a docx without file corruption. I'm not sure the image size is what you desire, though. I still don't have code to determine the image size of a PDF, but presumably this could be written.

mickmcq commented 11 years ago

I would just add that a slight variation of jgm's method for improving jpg images works. The exact method given above did not work, though. I tried that method as

convert -size 2046x952 -density 600 image.pdf image.jpg

which produced an enormous image when starting with a 1023x475 image. What worked was to say

convert -resize 40% -density 600 image.pdf image.jpg

This is using the convert in the ImageMagick package. I tried this with a number of files and am sure that specifying the higher density is adequate for my purposes. The resulting docx file contains clear images! Thanks!