Open simondotm opened 5 years ago
The relevant part from Test.docx:
<wp:docPr id="10" name="media/JIcACABwiXP.png"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="media/JIcACABwiXP.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId10"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5352176" cy="4219662"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
Probably related (or the same issues): https://github.com/jgm/pandoc/issues/1810 and https://github.com/jgm/pandoc/issues/5394
Do you know by what tool or word version the docx was generated?
The tool that generated Test.docx
was the Salesforce Quip app. Entirely possible their exported docx markup is somehow at fault here, but it did seem like a valid docx so thought I'd report it as an issue here.
Test2.docx
was generated by Word for Office 365, V16.0 32-bit, simply by opening Test.docx
and then "saving as" Test2.docx
- no other modifications to the doc.
The xml for Test2.docx
is:
<wp:docPr id="9" name="media/JIcACA7YtNb.png"/>
<wp:cNvGraphicFramePr/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="media/JIcACA7YtNb.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId5"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5352176" cy="2961313"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
Ah yes, they indeed look similar. Probably the key is in the document.xml.rels
files which contains also:
<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>
Btw. in neither LibreOffice nor Apple Pages the images show up...
Useful to know they dont render properly in other apps since that's indicative of some type of malformed document, in which case its definitely an issue for Quip.
Posting up the xml for reference...
Test.docx
document.xml.rels
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.jpg"/>
<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
<Relationship Id="rId2" Type="http://schemas.microsoft.com/office/2007/relationships/stylesWithEffects" Target="stylesWithEffects.xml"/>
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
<Relationship Id="rId9" Target="media/JIcACA7YtNb.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>
<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>
</Relationships>
Test2.docx
document.xml.rels
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
<Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/>
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
</Relationships>
I will try to see if I can isolate which difference in the files is the cause of this.
To help with comparing docx files, I wrote a little shell script, https://github.com/jgm/diff-docx This saves the trouble of unzipping and tidying. (I've now put this in the tools/ directory of this repository instead of its own repository.)
https://github.com/jgm/diff-docx returns 404 error.
Ah, it looks like the repository https://github.com/jgm/diff-docx is removed in favor of pandoc/tools/diff-zip.sh (see 83a0104).
I think the issue here is that the picture belongs to a graphic element instead of a drawing element. Not sure whether that's valid ooxml. The fact that Word changes it on saving seems to suggest maybe it's not?
Issue I've encountered a type of Docx file (ones that are exported from Quip), which do not export their images when using
--export-media
and-t markdown
. However, if the Docx is loaded into Word application, and then saved out, then the images will correctly export. This suggests it might be a file formatting issue, but the document renders fine in Word, and I compared thedocument.xml
in these two files however I couldn't spot any distinct different in the structures.Test Files I have attached two files:
Test.docx
- the original exported file, containing 2 embedded imagesTest2.docx
- the original exported file, loaded into and then saved out from WordReproduction
pandoc "Test.docx" --verbose --extract-media=test_media --atx-headers -f docx -t markdown -o "Test.md"
Result: No images are exported Expected: two images to be exported totest_media
folderpandoc "Test2.docx" --verbose --extract-media=test_media2 --atx-headers -f docx -t markdown -o "Test2.md"
Result: 2 images are exported totest_media2
folder, as expected.Environment Running Pandoc version
2.7.3
on Windows 10, 64-bit.Attachments Test.docx Test2.docx
Thanks for a great tool.