jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.08k stars 3.35k forks source link

Images of objects are not extracted #4735

Closed miraks31 closed 3 years ago

miraks31 commented 6 years ago

Hi,

I use pandoc 2.1.1 on windows and linux. When I try to convert this docx file, the image is not extracted. issue_object_as_image.docx

I think this is due to the fact that this is not a simple image, this is an object displayed as an image. But, because the image is well in the media directory into the docx (I checked it by changing the extension to .zip and extracting all files), I hope this should be able to extract this kind of image.

To reproduce this issue: pandoc.exe -s --from docx-simple_tables-multiline_tables-grid_tables+pipe_tables --to commonmark+pipe_tables issue_object_as_image.docx -o issue_object_as_image.md --extract-media media --file-scope --wrap=none --atx-headers

Result The image is not extracted and the reference to the image is missing in markdown file

Expected result The image is extracted and the reference to the image is in markdown file

Thank you for this great tool. Regards.

jgm commented 6 years ago

@jkr what do you think, is this feasible?

miraks31 commented 6 years ago

Hi @jkr,

A correction for this bug would be very appreciated. How can I help?

jgm commented 5 years ago

Sorry, I think this is out of scope for us. The image IS there, but it's actually not referred to by anything else in the docx, as far as I can see.

miraks31 commented 5 years ago

Hi jgm,

In the file word\document.xml, you can find the object with the link on the image: <v:imagedata r:id="rId5" o:title=""/>

In the document word_rel\document.xml.rels, you can find the corresponding file. <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.emf"/>

At the end the image is in: media/image1.emf

So, the link between the object and the associated image is existing.

In word documents, most of the time picture has been added by copy/paste from other applications (eg. Visio) and are not stored as a picture but as an object with a picture associated.

This could be great if pandoc is able to extract those picture too.

I think that all information are there to be done.

Thank you again for the job you did.

jgm commented 5 years ago

Hm, not sure how I missed that! Here's the XML:

        <w:object w:dxaOrig="9735" w:dyaOrig="5850">
          <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
            <v:stroke joinstyle="miter"/>
            <v:formulas>
              <v:f eqn="if lineDrawn pixelLineWidth 0"/>
              <v:f eqn="sum @0 1 0"/>
              <v:f eqn="sum 0 0 @1"/>
              <v:f eqn="prod @2 1 2"/>
              <v:f eqn="prod @3 21600 pixelWidth"/>
              <v:f eqn="prod @3 21600 pixelHeight"/>
              <v:f eqn="sum @0 0 1"/>
              <v:f eqn="prod @6 1 2"/>
              <v:f eqn="prod @7 21600 pixelWidth"/>
              <v:f eqn="sum @8 21600 0"/>
              <v:f eqn="prod @7 21600 pixelHeight"/>
              <v:f eqn="sum @10 21600 0"/>
            </v:formulas>
            <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
            <o:lock v:ext="edit" aspectratio="t"/>
          </v:shapetype>
          <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:488.1pt;height:293pt" o:ole="">
            <v:imagedata r:id="rId5" o:title=""/>
          </v:shape>
          <o:OLEObject Type="Embed" ProgID="Visio.Drawing.11" ShapeID="_x0000_i1025" DrawAspect="Content" ObjectID="_1591516258" r:id="rId6"/>
        </w:object>

I don't actually understand what this does. Is the image with id rId5 just a bitmap version of the whole drawn object, or is it part of the object? If the former, I guess we can look in w:object for v:shape and get the imagedata.

anuragnagardeveloper commented 4 years ago

is it resolved. I am facing the same problem

jgm commented 4 years ago

When an issue is open, it means that it has not yet been resolved.

mbrackeantidot commented 3 years ago

This issue has been resolved by the commit above.