python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.66k stars 1.14k forks source link

feature: InlineShape.image ? #249

Open DKWoods opened 8 years ago

DKWoods commented 8 years ago

Love your module.

I'm trying to add *.docx import to my python qualitative analysis tool, and python-docx has allowed me to bring content to a wxPython RichTextCtrl really easily. I'm getting all the character and paragraph level formatting and all of the text, which has come together really quickly.

But I seem to be missing something in reading in images. I can get the size and type of the images, but how to I get the IMAGE DATA to convert into an image object?

scanny commented 8 years ago

The API doesn't support this, you'll need to dig into the XML with lxml and some maybe some python-docx internals support if you want it bad enough :)

The general gist of the XML is here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/shapes-inline.html#specimen-xml and here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/picture.html#specimen-xml

You can get a handle to the wp:inline element using InlineShape._inline: https://github.com/python-openxml/python-docx/blob/master/docx/shape.py#L57

From there you can navigate to the pic:pic element using something like this:

_inline.graphic.graphicData.pic.blipFill.blip

From there you'll need to parse the embed link to get the relationship to the picture, which will be stored as a separate part.

These are just general guidelines, I don't have time unfortunately to get it down to working details, but should give you an idea of what's involved. I'm sure there must be ways the existing internals can help but I can't remember just now how it all works. You'll need to trace through the code a bit if it's worth the effort to you.

I would start that by tracing through how the .add_picture() bit works, you're basically looking to roughly reverse that.

DKWoods commented 8 years ago

Hi Steve,

I'm starting to explore the issue of loading images from docx files.
It's very interesting.

I've figured out inline shapes and getting the (internal) image name and ResourceID values.

I've explored Relationships and have figured out how to link the ResourceID with a physical file location in the docx archive file.

I've been exploring the underbelly of runs, and have now determined a way to tell if a run is associated with an image using the run's element.tags. But I can't see how you tell WHAT image a run is associated with. I can't find any reference to ResourceIDs, internal image name, relationships, nothing. How are inline_shapes and runs linked together so I know which inline shape is associated with whatever run I find that has an image element? It seems like the data has to be there somewhere, or you would lose your images if you loaded, then re-saved your document (and you don't). I just can't seem to find it.

Any suggestions would be welcome.

David

On 01/26/2016 12:28 AM, Steve Canny wrote:

The API doesn't support this, you'll need to dig into the XML with lxml and some maybe some python-docx internals support if you want it bad enough :)

The general gist of the XML is here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/shapes-inline.html#specimen-xml and here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/picture.html#specimen-xml

You can get a handle to the wp:inline element using InlineShape._inline: https://github.com/python-openxml/python-docx/blob/master/docx/shape.py#L57

From there you can navigate to the pic:pic element using something like this:

|_inline.graphic.graphicData.pic.blipFill.blip |

From there you'll need to parse the embed link to get the relationship to the picture, which will be stored as a separate part.

These are just general guidelines, I don't have time unfortunately to get it down to working details, but should give you an idea of what's involved. I'm sure there must be ways the existing internals can help but I can't remember just now how it all works. You'll need to trace through the code a bit if it's worth the effort to you.

I would start that by tracing through how the .add_picture() bit works, you're basically looking to roughly reverse that.

— Reply to this email directly or view it on GitHub https://github.com/python-openxml/python-docx/issues/249#issuecomment-174852781.

David K. Woods, Ph.D. Researcher, Lead Transana Developer Wisconsin Center for Education Research University of Wisconsin, Madison http://www.transana.org

scanny commented 8 years ago

Most of the grunt work with images is taken care of in the python-docx internals. You should be able to mostly leverage that for what you need, definitely all the bits about looking up image blobs from relationship ids and so on.

The first key thing you need is the so-called rId (relationship id). This comes from the a:blip element in the pic:pic element in the InlineShape object somewhere. I expect you've located that already: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/shapes/picture.html

With the rId in hand (something like 'rId5'), you can get the "related part", which will be an image part in this case:

document_part = document.part

# OR (if inline_shapes is already handy)
document_part = document.inline_shapes.part

# Then lookup the image part by rId
rId = however_you_get_the_rId_from_the_inline_shape()
image_part = document_part.related_parts[rId]

# docx.parts.image.ImagePart has some useful bits
filename = image_part.filename

# and also provides access to a docx.image.image.Image object which has even more goodies
image = image_part.image
bytes_of_image = image.blob
... and a bunch of other bits like dimensions, filename, content type, extension, etc.

Let me know if you need more once you've had a look at these :)

scanny commented 8 years ago

Hi David, what did you end up doing with this? Just wondering if a reasonably clear new API feature occurred to you after working this challenge that might be handy to have. Maybe InlineShape.image or something that retrieved an Image object for you?

Just sorting through the issue list here and wanted to encapsulate this one in the title if there was something you thought made sense.

DKWoods commented 8 years ago

Hi Steve,

I haven't done anything about it yet. I want to get feature/tabstops finished before I dive into this. One thing at a time is probably best right now.

David

On 04/21/2016 11:07 PM, Steve Canny wrote:

Hi David, what did you end up doing with this? Just wondering if a reasonably clear new API feature occurred to you after working this challenge that might be handy to have. Maybe |InlineShape.image| or something that retrieved an Image object for you?

Just sorting through the issue list here and wanted to encapsulate this one in the title if there was something you thought made sense.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/python-openxml/python-docx/issues/249#issuecomment-213241474

aschilling commented 8 years ago

Hi Steve, Hi David

I am struggling at the moment with the same issue as David. In particular, I need to copy tables, paragraphs, and images from one docx to another docx. To do so, the item_block_items method outlined in https://github.com/python-openxml/python-docx/issues/40 was of great help for me. However, this method only considers paragraphs and tables. Is there a way to extend this method to also consider inline shapes?

Best regards Andy

scanny commented 8 years ago

Anything is possible for the diligent developer :)

But not something that anyone is working on at the moment as far as I know, if that's what you're asking :)

DKWoods commented 7 years ago

Hi Steve,

I am turning my attention to this problem again after a number of distractions such as making sure my kids can eat most days. You know how that goes.

I've never been able to get a handle on a specific Inline object from a Run when reading a file or when using run.add_picture() rather than document.add_picture().

Document.add_picture() produces a CT_Inline object, and this allows access to everything that's needed. However, run.add_picture() produces a CT_Run object with something in drawing_lst[0], but that something is not a CT_Inline object and I haven't been able to crack what that something is. (It's a lxml.etree._Element object, but I can't figure out where to go from there.) When reading a file, images are held in Runs, not CT_Inline objects no matter how they were created.

My impression is that the pieces are mostly there for reading files and seeing graphics within Runs, but the connections from one level of the XML to the next are getting lost on the oxml level at the w:drawing level. I believe this because the CT_Drawing object referenced in docs/dev/analysis/features/text/run-content.rst and docs/dev/analysis/features/shapes/shapes-inline.rst does not appear to actually exist in the python_docx code. CT_Inline exists, and all the objects needed down the line from there to get all the data needed from the XML exist, but without the CT_Drawing object, nothing appears to be accessible.

So I'm thinking that I need to add CT_Drawing to oxml/shape.py, and that if I do this right, linking it to CT_Inline correctly, then I'm just about where I need to be. I'll be able to read the CT_Drawing object in the Run's drawing_lst to get the information I need to proceed.

Does that seem possible to you? Does this approach make sense? Let me know if you need more information to make sense of what I'm saying, beyond what's already in the existing features documentation.

David

EDIT 30 minutes later: To answer my own question, yes, that approach makes sense. IT WORKS. I can now gain access to images in runs in my modified python_docx source code. Tomorrow morning, I start working on how to submit it properly so it can be integrated into the release version of the code.

scanny commented 7 years ago

Yes, that makes sense to me David. Document.inline_shapes uses XPath (a few calls down the call tree) to locate the wp:inline elements, so explicit navigation to the intervening w:drawing elements was never required before now. My policy has always been not to add oxml features until they were needed to pass a unit test; so that explains its absence.

I notice the MS API has Range.InlineShapes, with Range being a superset of our Run (which doesn't appear by itself in the MS API). So initially I'd be inclined to add an .inline_shapes property to Run. That would allow you to test for the presence of an image and also to access it. It would also give a nice round-trip acceptance testing capability for Run.add_picture.

In practice, I think one finds at most a single picture in a run. But the schema places no limitation on how many can appear in the same run. That's good for us I think; we won't need Run.add_picture() to keep track of whether it already has one.

saruagithub commented 6 years ago

@aschilling Have you fixed the problem? ' copy tables, paragraphs, and images from one docx to another docx'

mfripp commented 3 years ago

@scanny You mentioned in a comment that you were considering adding a .inline_shapes attribute to Run. I'm using docx version 0.8.10 and it doesn't look like this exists yet. Is there some other way to check whether a run has an inline shape in it? I've noticed that if I copy paragraphs using doc._body._body._insert_p(para._p) (from this comment), the inline shape is copied too (yay!). But I can't see any way to access the inline shape from the paragraph or run.

I have managed to save a copy of the inline shape using this:

rId = doc.inline_shapes[0]._inline.graphic.graphicData.pic.blipFill.blip.embed
image_part = doc.part.related_parts[rId]
filename = image_part.filename
bytes_of_image = image_part.image.blob
with open(filename, 'wb') as f:  # make a copy in the local dir
    f.write(bytes_of_image)

However, I can't figure out whether any particular paragraph or run has a picture in it, and navigate to the image object from there. Any suggestions?

mfripp commented 3 years ago

I got a little further with this:

I found that a run r has an attribute _r which seems to have lxml element objects with data closer to the original Word docx definition. In the run that has an inline image, I find that list(r._r) shows the following elements:

[<CT_RPr '<w:rPr>' at 0x115bf7b80>,
 <Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}lastRenderedPageBreak at 0x115c44a40>,
 <Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing at 0x115c44a80>]

I assume the first part is the text. The last element has data on the picture.

Digging deeper, I find that I can get the rId as

rId = r._r[2][0][-1][0][0][1][0].embed

This is the chain of elements I went through to get there:

list(r._r[2]): [<CT_Inline '<wp:inline>' at 0x115616590>]

list(r._r[2][0]) = [
    <CT_PositiveSize2D '<wp:extent>' at 0x115c497c0>,
    <Element {http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}effectExtent at 0x115c17180>,
    <CT_NonVisualDrawingProps '<wp:docPr>' at 0x115c4d950>,
    <Element {http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}cNvGraphicFramePr at 0x115c179c0>,
    <CT_GraphicalObject '<a:graphic>' at 0x115c4d7c0>
]

list(r._r[2][0][-1]) = [
    <CT_GraphicalObjectData '<a:graphicData>' at 0x115c4dd60>
]

list(r._r[2][0][-1][0]) = [<CT_Picture '<pic:pic>' at 0x115c97860>]

list(r._r[2][0][-1][0][0]) = [
    <CT_PictureNonVisual '<pic:nvPicPr>' at 0x115c97d60>,
    <CT_BlipFillProperties '<pic:blipFill>' at 0x115c976d0>,
    <CT_ShapeProperties '<pic:spPr>' at 0x115bf7220>
]

list(r._r[2][0][-1][0][0][1]) = [
    <CT_Blip '<a:blip>' at 0x1155ad630>,
    <Element {http://schemas.openxmlformats.org/drawingml/2006/main}stretch at 0x115c20400>
]

This doesn't look like easy to do programmatically, but I could probably do it by inspecting the .tag attribute for all the objects in the list at each level, and searching for the one I expect. But I think I'd also need to learn more about how text is represented in r._r, so I could figure out which text is before and after the picture (if any).

Is there any more direct way to figure out where an inline picture comes in a run and to find its size, type/filename and bytes?

Edit After digging a little more, I've found that I can get the rId via

rId = r._r[2][0].graphic.graphicData.pic.blipFill.blip.embed

Then I can use the code above to get a copy of the image.

But I'm having trouble understanding why some elements below r._r can be accessed with dot notation and others can't. And I think I probably still need to figure out something about where the picture (r._r[2]) sits relative to text in the same run.

scanny commented 3 years ago

@mfripp these secrets are revealed in this part of the code: https://github.com/python-openxml/python-docx/blob/master/docx/oxml/text/run.py#L22

print(type(run._r)) will give CT_R which is the "custom element-class" referenced above. All w:r (run) XML elements are parsed as an instance of this class. The items declared at the top like rPr, t, and br are instantiated by the metaclass for these element classes as names (think @propertys) on the object. The ZeroOrOne(), ZeroOrMore(), etc. functions determine whether the property is an optional element or a list of elements respectively. There are also OneAndOnlyOne() and OneOrMore that appear elsewhere.

To demonstrate:

>>> r = run._r
>>> type(r)
CT_R
>>> type(r.rPr)
CT_RPr  # run properties, basically font with bold, italic, size, font name, etc.
>>> len(r.t_lst)  # all sequence properties get `_lst` appended, so OneOrMore or ZeroOrMore "fields".
1
>>> dir(r)
... long list of all the available methods and properties, many of which are defined by the metaclass ...
>>> r.xml
... dump of the XML for this run element only ...

So whatever you want from a run element should be available by name. Accessing sub-elements by index is very unreliable because so many are optional and others can appear more than once, like <w:t> and <w:br> in this case.

If you know where you're headed then XPath is usually faster than digging through a hierarchy where there are optional elements, like:

>>> r.xpath(".//w:pic")
... a list of zero or more `<w:pic>` elements in this run (might be `a:pic`, check the XML dump) ...

Not all elements have custom element classes, only those we've provided some access to via python-docx. An element without a custom element class is still an lxml.etree._Element object and has that interface available (search on that string for the lxml API documentation) with methods like .getparent() and .addsibling() or whatever they are.

That's a start toward understanding anyway.

mfripp commented 3 years ago

@scanny Thanks for your advice on this. I'd gotten part of the way down the path you recommended, but I was having trouble going further. When I use type(r), I get docx.oxml.text.run.CT_R – close enough, but a little different. Similarly, for type(r.rPr), I get docx.oxml.text.font.CT_RPr.

Where things get tough is when I use dir(r): then I just get TypeError: descriptor '__dict__' for '_OxmlElementBase' objects doesn't apply to a 'CT_R' object, and I get similar results for any of the other elements under run._r. This makes it hard to see what methods or attributes these objects have.

I previously tried looking at the source code you mentioned, but I ran into a problem there too: CT_R.add_drawing() calls self._add_drawing(), but that doesn't seem to be defined explicitly anywhere. I think it is some kind of generic code that gets adapted to add various types of methods dynamically. But once the factories started showing up, I had trouble following the thread for how this works.

I had code that sort of worked for getting an image from a run. I also found that Word (at least my version) never puts more than an image in the same run as text or another image. So that makes them pretty easy to work with, i.e., I don't have to worry about whether there is text before or after the image. This is what that code looked like:

for e in run._r:
    if 'drawing' in e.tag:
        rId = e[0].graphic.graphicData.pic.blipFill.blip.embed
        image_part = doc.part.related_parts[rId]
        ...

The part that was annoying was having to check all the sub-elements of run._r and then assume the inline element is in position 0 in the drawing element. This is the xml tree for this run:

<w:r xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" [many more xmlns] w:rsidR="009F5905">
  <w:rPr>
    <w:noProof/>
  </w:rPr>
  <w:drawing>
    <wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="1AE1DFDD" wp14:editId="6075C513">
      <wp:extent cx="4504623" cy="2979436"/>
      <wp:effectExtent l="0" t="0" r="0" b="0"/>
      <wp:docPr id="1" name="Picture 1" descr="Shape&#10;&#10;Description automatically generated"/>
      <wp:cNvGraphicFramePr>
        <a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
      </wp:cNvGraphicFramePr>
      <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
        <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
          <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
            <pic:nvPicPr>
              <pic:cNvPr id="5" name="Picture 5" descr="Shape&#10;&#10;Description automatically generated"/>
              <pic:cNvPicPr/>
            </pic:nvPicPr>
            <pic:blipFill>
              <a:blip r:embed="rId4"/>
              <a:stretch>
                <a:fillRect/>
              </a:stretch>
            </pic:blipFill>
            <pic:spPr>
              <a:xfrm>
                <a:off x="0" y="0"/>
                <a:ext cx="4577094" cy="3027370"/>
              </a:xfrm>
              <a:prstGeom prst="rect">
                <a:avLst/>
              </a:prstGeom>
            </pic:spPr>
          </pic:pic>
        </a:graphicData>
      </a:graphic>
    </wp:inline>
  </w:drawing>
</w:r>

I can access run._r.rPr no problem, but when I try to access run._r.drawing, I get AttributeError: 'CT_R' object has no attribute 'drawing'. So I have to use run._r[1] instead. Similarly, if I try to access run._r[1].inline, I get an attribute error, so I have to use run._r[1][0] instead. From there the dot notation works OK: I can access run._r[1][0].graphic.graphicData.pic.blipFill.blip.embed successfully.

I've now tried run._r.xpath(".//wp:inline") and that does a good job of finding the inline object(s), and then I can work from there. So now I have pretty clean code that looks like this:

for para in doc.paragraphs:
    for run in para.runs:
        for inline in run._r.xpath("w:drawing/wp:inline"):
            width = float(inline.extent.cx) # in EMUs https://startbigthinksmall.wordpress.com/2010/01/04/points-inches-and-emus-measuring-units-in-office-open-xml/
            height = float(inline.extent.cy)
            rId = inline.graphic.graphicData.pic.blipFill.blip.embed
            image = doc.part.related_parts[rId].image
            filename = image.filename
            with open(filename, 'wb') as f:  # make a copy in the local dir
                f.write(image.blob)
            print(', '.join([
                f"saved image {filename}",
                f"type {image.content_type}",
                f"px: {image.px_height} x {image.px_width}",
                f"size in document: {height} x {width}",
            ]))
scanny commented 3 years ago

Hmm, not sure why the dir(r) bit isn't working, maybe there's a change in how metaclasses work in Python 3 or maybe I just forgot something. I don't work with metaclasses a lot and probably not at all since writing this quite a while ago now.

Each name defined as a child element or attribute will get certain helper methods and properties. Like an attribute "foo" will cause a .foo property to be added to the custom element class. Child elements will get methods like ._add_foo, ._remove_foo, .get_or_add_foo() etc. I don't remember all the names exactly, I usually look for a similar example in another element class and that's enough to remind me. In a pinch you can read the metaclass code. It's not a quick read but will yield to close study.

Glad it seems like you got things working. I'd definitely use the .xpath() call too for anything that's not a direct descendent from my entry point.