jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.57k stars 3.38k forks source link

Docx to markdown - images are not exported with --export-media with certain types of docx #5640

Open simondotm opened 5 years ago

simondotm commented 5 years ago

Issue I've encountered a type of Docx file (ones that are exported from Quip), which do not export their images when using --export-media and -t markdown. However, if the Docx is loaded into Word application, and then saved out, then the images will correctly export. This suggests it might be a file formatting issue, but the document renders fine in Word, and I compared the document.xml in these two files however I couldn't spot any distinct different in the structures.

Test Files I have attached two files: Test.docx - the original exported file, containing 2 embedded images Test2.docx - the original exported file, loaded into and then saved out from Word

Reproduction pandoc "Test.docx" --verbose --extract-media=test_media --atx-headers -f docx -t markdown -o "Test.md" Result: No images are exported Expected: two images to be exported to test_media folder

pandoc "Test2.docx" --verbose --extract-media=test_media2 --atx-headers -f docx -t markdown -o "Test2.md" Result: 2 images are exported to test_media2 folder, as expected.

[INFO] Extracting test_media2\media\image1.png...
[INFO] Extracting test_media2\media\image2.png...

Environment Running Pandoc version 2.7.3 on Windows 10, 64-bit.

Attachments Test.docx Test2.docx

Thanks for a great tool.

mb21 commented 5 years ago

The relevant part from Test.docx:

    <wp:docPr id="10" name="media/JIcACABwiXP.png"/>
    <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
      <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
        <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
          <pic:nvPicPr>
            <pic:cNvPr id="0" name="media/JIcACABwiXP.png"/>
            <pic:cNvPicPr/>
          </pic:nvPicPr>
          <pic:blipFill>
            <a:blip r:embed="rId10"/>
            <a:stretch>
              <a:fillRect/>
            </a:stretch>
          </pic:blipFill>
          <pic:spPr>
            <a:xfrm>
              <a:off x="0" y="0"/>
              <a:ext cx="5352176" cy="4219662"/>
            </a:xfrm>
            <a:prstGeom prst="rect"/>
          </pic:spPr>
        </pic:pic>
      </a:graphicData>
    </a:graphic>

Probably related (or the same issues): https://github.com/jgm/pandoc/issues/1810 and https://github.com/jgm/pandoc/issues/5394

Do you know by what tool or word version the docx was generated?

simondotm commented 5 years ago

The tool that generated Test.docx was the Salesforce Quip app. Entirely possible their exported docx markup is somehow at fault here, but it did seem like a valid docx so thought I'd report it as an issue here.

Test2.docx was generated by Word for Office 365, V16.0 32-bit, simply by opening Test.docx and then "saving as" Test2.docx - no other modifications to the doc.

The xml for Test2.docx is:

<wp:docPr id="9" name="media/JIcACA7YtNb.png"/>
<wp:cNvGraphicFramePr/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
    <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
        <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
            <pic:nvPicPr>
                <pic:cNvPr id="0" name="media/JIcACA7YtNb.png"/>
                <pic:cNvPicPr/>
            </pic:nvPicPr>
            <pic:blipFill>
                <a:blip r:embed="rId5"/>
                <a:stretch>
                    <a:fillRect/>
                </a:stretch>
            </pic:blipFill>
            <pic:spPr>
                <a:xfrm>
                    <a:off x="0" y="0"/>
                    <a:ext cx="5352176" cy="2961313"/>
                </a:xfrm>
                <a:prstGeom prst="rect">
                    <a:avLst/>
                </a:prstGeom>
            </pic:spPr>
        </pic:pic>
    </a:graphicData>
</a:graphic>
mb21 commented 5 years ago

Ah yes, they indeed look similar. Probably the key is in the document.xml.rels files which contains also:

<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>

Btw. in neither LibreOffice nor Apple Pages the images show up...

simondotm commented 5 years ago

Useful to know they dont render properly in other apps since that's indicative of some type of malformed document, in which case its definitely an issue for Quip.

Posting up the xml for reference... Test.docx document.xml.rels

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.jpg"/>
  <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
  <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
  <Relationship Id="rId2" Type="http://schemas.microsoft.com/office/2007/relationships/stylesWithEffects" Target="stylesWithEffects.xml"/>
  <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
  <Relationship Id="rId9" Target="media/JIcACA7YtNb.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>
<Relationship Id="rId10" Target="media/JIcACABwiXP.png" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"/>

</Relationships>

Test2.docx document.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
    <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
    <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
    <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
    <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/>
    <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
    <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
</Relationships>

I will try to see if I can isolate which difference in the files is the cause of this.

jgm commented 5 years ago

To help with comparing docx files, I wrote a little shell script, https://github.com/jgm/diff-docx This saves the trouble of unzipping and tidying. (I've now put this in the tools/ directory of this repository instead of its own repository.)

tolot27 commented 5 years ago

https://github.com/jgm/diff-docx returns 404 error.

tolot27 commented 5 years ago

Ah, it looks like the repository https://github.com/jgm/diff-docx is removed in favor of pandoc/tools/diff-zip.sh (see 83a0104).

jgm commented 3 years ago

I think the issue here is that the picture belongs to a graphic element instead of a drawing element. Not sure whether that's valid ooxml. The fact that Word changes it on saving seems to suggest maybe it's not?