jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.95k stars 3.35k forks source link

Images within table context not being extracted #10018

Open somonek opened 1 month ago

somonek commented 1 month ago

Explain the problem. The problem is pretty straightforward. Images placed in a table are not being extracted. Tested on macos and on an instance running the pandoc/core:3.2.1-ubuntu docker image. command used: pandoc test.docx -o output.html --extract-media ./

Example document to reproduce the issue test.docx

Pandoc version? pandoc --version output:

pandoc 3.2.1
Features: +server +lua
Scripting engine: Lua 5.4
jgm commented 1 month ago

Converting to odt also loses the images. Looks like they are not being included in the MediaBag?

jgm commented 1 month ago

Odd: here's a docx with an image in a table, and the image does get extracted: test1.docx

jgm commented 1 month ago

OK, -t native confirms that the images aren't parsed at all in test.docx. (So it's nothing about --extract-media specifically.)

I suspect this is because in this document, the images are in a wpg:grpSp element.

jgm commented 1 month ago

The structure in more detail:

mc:AlternateContent
  mc:Choice Requires="wpg"
    w:drawing
      a:graphic
        a:graphicData
          wpg:wgp
            wpg:grpSp
              wpg:grpSp
                wpg:grpSp
                  wpg:grpSp
                    pic:pic
  mc:Fallback
    w:pict
      (this one contains image data)

I omit a lot of complexity. I'm not sure why all this is there. In this case going to the fallback is the thing to do; I'll need to see what the Reader is currently doing here.

jgm commented 1 month ago

Currently the reader just always uses the first Choice in mc:AlternateContent. We could check for Requires="wpg" and take the fallback in that case...but I don't really even understand what wpg is about.

somonek commented 1 month ago

I bumped into the document that produces this issue, but I don't really know exactly what's its history. wpg (as per my findings: “Word Processing Graphics” namespace) seems to refer to drawings made directly in word, so I assume the author first tried to draw some diagrams/images directly in word, then it got too complex and switched to a different tool, to then import the image in the table cells instead of the inline drawings. Perhaps that's how the underlying xml became a bit unusual. It's my assumption though, not sure if that makes sense.