Open sven-oly opened 7 years ago
Can you be more specific?
Yes. I have a docx file that contains several drawing objects, each of which has text fields. When I view the XML for the file, I can see the text fields, and can identify the XML structures. I was hoping that python-docx would be able to find these text fields and let me transliterate the text in place.
Does this help? I can provide a sample.
On Mon, Mar 27, 2017 at 4:53 PM, Steve Canny notifications@github.com wrote:
Can you be more specific?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289620449, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEvj6FdRTaaeJmf2DzBVemBNie4Fyks5rqEwTgaJpZM4Mq9jD .
What are the drawings? Just like JPEGs or something?
Also, are they inline or floating?
And what do you mean they "have text fields"? Just that the picture has text in it?
Thanks for your questions. I'm not an expert, but received a docx file from a colleague who wanted me to extract the text fields from all parts of the file.
I found that some of the text was not being found, and discovered that some was not found by python-docx. Investigating further, I found that Word can include Drawing components, which are not just JPEGs. Drawings are structured objects that can include text fields, formatting data, as well as other fields. It is not just text rendered into the pixels, but rather formatted text data, much as in paragraphs.
The drawings that I saw appear to be floating rather than inline.
See this for a high-level description of Drawing: https://support.office.com/en-us/article/Add-a-drawing-to-a-document-348a8390-c32e-43d0-942c-b20ad11dea6f
The XML of the docx could be parsed to get the text objects, formatting, etc., but the current python-docx does not deal with that structure.
I can provide a simple example of an embedded drawing in a .docx file.
On Tue, Mar 28, 2017 at 1:31 AM, Steve Canny notifications@github.com wrote:
What are the drawings? Just like JPEGs or something?
Also, are they inline or floating?
And what do you mean they "have text fields"? Just that the picture has text in it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289700384, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEouBQGiQaNH07Vv687wwoJTA5uGyks5rqMVmgaJpZM4Mq9jD .
Ok, well, the short answer is there's no API support for this in python-docx
.
I have an idea what the drawings are now from your description, and I expect they're roughly the same as PowerPoint auto-shapes, which can indeed contain text. So I expect the text is in there somewhere, but it would require some fairly fancy technical work to extract it.
I don't think it's too much harder than the current handling of text in paragraphs, though. I'd be interested in working on this, though I'd have to review the current project to understand how it works.
On Tue, Mar 28, 2017 at 12:08 PM, Steve Canny notifications@github.com wrote:
Ok, well, the short answer is there's no API support for this in python-docx.
I have an idea what the drawings are now from your description, and I expect they're roughly the same as PowerPoint auto-shapes, which can indeed contain text. So I expect the text is in there somewhere, but it would require some fairly fancy technical work to extract it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289873275, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQNEuyVi1oRzP2qmcQ8cuXmeYmEgVapks5rqVqagaJpZM4Mq9jD .
This sounds like WordArt to me. I was looking at different Drawing elements recently as I worked to figure out how to read images from a Word file, and this sounds exactly like what I saw when I looked at what WordArt puts into a docx. file.
David
On 3/28/2017 11:32 AM, Craig Cornelius wrote:
Thanks for your questions. I'm not an expert, but received a docx file from a colleague who wanted me to extract the text fields from all parts of the file.
I found that some of the text was not being found, and discovered that some was not found by python-docx. Investigating further, I found that Word can include Drawing components, which are not just JPEGs. Drawings are structured objects that can include text fields, formatting data, as well as other fields. It is not just text rendered into the pixels, but rather formatted text data, much as in paragraphs.
The drawings that I saw appear to be floating rather than inline.
See this for a high-level description of Drawing: https://support.office.com/en-us/article/Add-a-drawing-to-a-document-348a8390-c32e-43d0-942c-b20ad11dea6f
The XML of the docx could be parsed to get the text objects, formatting, etc., but the current python-docx does not deal with that structure.
I can provide a simple example of an embedded drawing in a .docx file.
On Tue, Mar 28, 2017 at 1:31 AM, Steve Canny notifications@github.com wrote:
What are the drawings? Just like JPEGs or something?
Also, are they inline or floating?
And what do you mean they "have text fields"? Just that the picture has text in it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub
https://github.com/python-openxml/python-docx/issues/380#issuecomment-289700384, or mute the thread
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/python-openxml/python-docx/issues/380#issuecomment-289827678, or mute the thread https://github.com/notifications/unsubscribe-auth/AL5BDcmwS5TkDGOClsyGJQ4EPU7OEtOyks5rqTYJgaJpZM4Mq9jD.
-- David K. Woods, Ph.D. Researcher, Transana Lead Developer https://www.transana.com
@DKWoods Word also supports "PowerPoint-style" drawings. If you look on the menu under Insert > Shape... you see it. The shapes it inserts are the same DrawingML shapes that can appear on PowerPoint slides.
I'm interested in getting all text in a docx file to transliterate it, e.g., Serbian Cyrillic to Serbian Latin text.
The current implementation doesn't seem to handle drawings within the docx file.