python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.57k stars 1.12k forks source link

Drawings are not parsed, and text inside drawings is not available. #380

Open sven-oly opened 7 years ago

sven-oly commented 7 years ago

I'm interested in getting all text in a docx file to transliterate it, e.g., Serbian Cyrillic to Serbian Latin text.

The current implementation doesn't seem to handle drawings within the docx file.

scanny commented 7 years ago

Can you be more specific?

sven-oly commented 7 years ago

Yes. I have a docx file that contains several drawing objects, each of which has text fields. When I view the XML for the file, I can see the text fields, and can identify the XML structures. I was hoping that python-docx would be able to find these text fields and let me transliterate the text in place.

Does this help? I can provide a sample.

On Mon, Mar 27, 2017 at 4:53 PM, Steve Canny notifications@github.com wrote:

Can you be more specific?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289620449, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEvj6FdRTaaeJmf2DzBVemBNie4Fyks5rqEwTgaJpZM4Mq9jD .

scanny commented 7 years ago

What are the drawings? Just like JPEGs or something?

Also, are they inline or floating?

And what do you mean they "have text fields"? Just that the picture has text in it?

sven-oly commented 7 years ago

Thanks for your questions. I'm not an expert, but received a docx file from a colleague who wanted me to extract the text fields from all parts of the file.

I found that some of the text was not being found, and discovered that some was not found by python-docx. Investigating further, I found that Word can include Drawing components, which are not just JPEGs. Drawings are structured objects that can include text fields, formatting data, as well as other fields. It is not just text rendered into the pixels, but rather formatted text data, much as in paragraphs.

The drawings that I saw appear to be floating rather than inline.

See this for a high-level description of Drawing: https://support.office.com/en-us/article/Add-a-drawing-to-a-document-348a8390-c32e-43d0-942c-b20ad11dea6f

The XML of the docx could be parsed to get the text objects, formatting, etc., but the current python-docx does not deal with that structure.

I can provide a simple example of an embedded drawing in a .docx file.

On Tue, Mar 28, 2017 at 1:31 AM, Steve Canny notifications@github.com wrote:

What are the drawings? Just like JPEGs or something?

Also, are they inline or floating?

And what do you mean they "have text fields"? Just that the picture has text in it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289700384, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEouBQGiQaNH07Vv687wwoJTA5uGyks5rqMVmgaJpZM4Mq9jD .

scanny commented 7 years ago

Ok, well, the short answer is there's no API support for this in python-docx.

I have an idea what the drawings are now from your description, and I expect they're roughly the same as PowerPoint auto-shapes, which can indeed contain text. So I expect the text is in there somewhere, but it would require some fairly fancy technical work to extract it.

sven-oly commented 7 years ago

I don't think it's too much harder than the current handling of text in paragraphs, though. I'd be interested in working on this, though I'd have to review the current project to understand how it works.

On Tue, Mar 28, 2017 at 12:08 PM, Steve Canny notifications@github.com wrote:

Ok, well, the short answer is there's no API support for this in python-docx.

I have an idea what the drawings are now from your description, and I expect they're roughly the same as PowerPoint auto-shapes, which can indeed contain text. So I expect the text is in there somewhere, but it would require some fairly fancy technical work to extract it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289873275, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEuyVi1oRzP2qmcQ8cuXmeYmEgVapks5rqVqagaJpZM4Mq9jD .

DKWoods commented 7 years ago

This sounds like WordArt to me. I was looking at different Drawing elements recently as I worked to figure out how to read images from a Word file, and this sounds exactly like what I saw when I looked at what WordArt puts into a docx. file.

David

On 3/28/2017 11:32 AM, Craig Cornelius wrote:

Thanks for your questions. I'm not an expert, but received a docx file from a colleague who wanted me to extract the text fields from all parts of the file.

I found that some of the text was not being found, and discovered that some was not found by python-docx. Investigating further, I found that Word can include Drawing components, which are not just JPEGs. Drawings are structured objects that can include text fields, formatting data, as well as other fields. It is not just text rendered into the pixels, but rather formatted text data, much as in paragraphs.

The drawings that I saw appear to be floating rather than inline.

See this for a high-level description of Drawing: https://support.office.com/en-us/article/Add-a-drawing-to-a-document-348a8390-c32e-43d0-942c-b20ad11dea6f

The XML of the docx could be parsed to get the text objects, formatting, etc., but the current python-docx does not deal with that structure.

I can provide a simple example of an embedded drawing in a .docx file.

On Tue, Mar 28, 2017 at 1:31 AM, Steve Canny notifications@github.com wrote:

What are the drawings? Just like JPEGs or something?

Also, are they inline or floating?

And what do you mean they "have text fields"? Just that the picture has text in it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/python-openxml/python-docx/issues/380#issuecomment-289700384, or mute the thread

https://github.com/notifications/unsubscribe-auth/AKQNEouBQGiQaNH07Vv687wwoJTA5uGyks5rqMVmgaJpZM4Mq9jD .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289827678, or mute the thread https://github.com/notifications/unsubscribe-auth/AL5BDcmwS5TkDGOClsyGJQ4EPU7OEtOyks5rqTYJgaJpZM4Mq9jD.

-- David K. Woods, Ph.D. Researcher, Transana Lead Developer https://www.transana.com

scanny commented 7 years ago

@DKWoods Word also supports "PowerPoint-style" drawings. If you look on the menu under Insert > Shape... you see it. The shapes it inserts are the same DrawingML shapes that can appear on PowerPoint slides.