Open somonek opened 3 months ago
Converting to odt also loses the images. Looks like they are not being included in the MediaBag?
Odd: here's a docx with an image in a table, and the image does get extracted: test1.docx
OK, -t native
confirms that the images aren't parsed at all in test.docx. (So it's nothing about --extract-media
specifically.)
I suspect this is because in this document, the images are in a wpg:grpSp
element.
The structure in more detail:
mc:AlternateContent
mc:Choice Requires="wpg"
w:drawing
a:graphic
a:graphicData
wpg:wgp
wpg:grpSp
wpg:grpSp
wpg:grpSp
wpg:grpSp
pic:pic
mc:Fallback
w:pict
(this one contains image data)
I omit a lot of complexity. I'm not sure why all this is there. In this case going to the fallback is the thing to do; I'll need to see what the Reader is currently doing here.
Currently the reader just always uses the first Choice in mc:AlternateContent.
We could check for Requires="wpg"
and take the fallback in that case...but I don't really even understand what wpg
is about.
I bumped into the document that produces this issue, but I don't really know exactly what's its history. wpg
(as per my findings: “Word Processing Graphics” namespace) seems to refer to drawings made directly in word, so I assume the author first tried to draw some diagrams/images directly in word, then it got too complex and switched to a different tool, to then import the image in the table cells instead of the inline drawings.
Perhaps that's how the underlying xml became a bit unusual. It's my assumption though, not sure if that makes sense.
Explain the problem. The problem is pretty straightforward. Images placed in a table are not being extracted. Tested on macos and on an instance running the pandoc/core:3.2.1-ubuntu docker image. command used:
pandoc test.docx -o output.html --extract-media ./
Example document to reproduce the issue test.docx
Pandoc version?
pandoc --version
output: