python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.38k stars 1.09k forks source link

Replace text in paragraph keeping the runs object and styles #415

Open KGomes27 opened 7 years ago

KGomes27 commented 7 years ago

Hi! I'm working on a translation app and I would like to take a docx file and create a new one with the same styles and runs, but having the text in a different language. Is there any way I could do this?

I read that changing paragraph's text, removes all styles and runs from it! so, this is not an option for me, since I need to keep them for my new document.

The other approach I tried was using runs object to create my new file. The problem with this is that I have to translate the text of each run separate and the final text doesn't make any sense. Again, not an option for me, because it's a translation app and I need my final text to make sense hahaha

Any other ideas? thanks in advance!

DKWoods commented 7 years ago

Hi! I'm working on a translation app and I would like to take a docx file and create a new one with the same styles and runs, but having the text in a different language. Is there any way I could do this?

I read that changing paragraph's text, removes all styles and runs from it! so, this is not an option for me, since I need to keep them for my new document.

The other approach I tried was using runs object to create my new file. The problem with this is that I have to translate the text of each run separate and the final text doesn't make any sense. Again, not an option for me, because it's a translation app and I need my final text to make sense hahaha

Any other ideas? thanks in advance!

That's the whole point of runs. They tie content and formatting together.

Think of each run within a paragraph as an object containing text and formatting. During translation, the order of these runs within the paragraph might change so that the text makes sense, but the FORMATTING would change in exactly the same order, wouldn't it? Whatever words were highlighted, underlined, (or whatever), would still want to be highlighted in the same way in the new run position within the paragraph. So during the translation process, you're not re-ordering just text, but text-and-formatting, which is to say, runs.

You might need to break some of your runs into multiple parts because of the subtleties of different word orders in different languages, but you can figure that out and carry the formatting along with the words into the multiple new parts.

So "read" all the runs in a paragraph, re-arrange them during translation, and build the new paragraph in the new document out of the re-arranged runs.

Good luck.

David

UttamDwivedi commented 5 years ago

@KGomes27 Could you solve your problem? If yes, could you please share ? I am trying to deal with similar problem as well.

KGomes27 commented 5 years ago

@UttamDwivedi Hi, I wasn't able to find a good enough solution. What I did at the moment was follow David's recommendation to at least get some of the original paragraph's styles onto the new paragraph.

So "read" all the runs in a paragraph, re-arrange them during translation, and build the new paragraph in the new document out of the re-arranged runs.

Hope this helps you in any way and if you find a solution please share it on this issue!

scanny commented 3 years ago

Here is some code I developed in response to an SO questions that could be relevant:
https://github.com/python-openxml/python-docx/issues/980

Basically it allows you to isolate a range of characters in a paragraph into its own single run. Without knowing more about your translation algorithm I couldn't say how exactly it might fit in, but perhaps most crudely, just called repeatedly on each word in a paragraph to make sure they each occupy their own distinct run that has the formatting of the run that word originally belonged to.

In any case, if you need to manipulate run boundaries, the sub-functions there are likely to be instructive.

sinhnv1991 commented 2 years ago

Hi everyone, i want find a solution. I working on translate app too like smartcart.com extract text and translate to other language and create new doc file result and keep format style text. I have issue, i unzip docx and receipt a document.xml file contain

Vietnam's National Assembly has passed a cybersecurity law requiring companies such as Alphabet Inc.'s Google and Facebook Inc. to store all data of Vietnam-based users in the country and open local offices. The measure has drawn rare dissent from some lawmakers and government leaders as well as local tech groups, who sent a petition to the legislature that warned it would hurt the economy. i want extract text by newline break or ". " to 3 segment: 1 => • 2 => Vietnam's National Assembly has passed a cybersecurity law requiring companies such as Alphabet Inc.'s Google and Facebook Inc. to store all data of Vietnam-based users in the country and open local offices. 3 => The measure has drawn rare dissent from some lawmakers and government leaders as well as local tech groups, who sent a petition to the legislature that warned it would hurt the economy. I see result json api of smartcart for segment 3 same: { "segmentId": 4079, "text": "The measure has drawn rare dissent from some lawmakers and government leaders as well as local tech groups, who sent a petition to the legislature that warned it would hurt the economy.", "languageId": 6153, "tags": [ { "tagNumber": 1, "tagType": 0, "position": 0, "isSubtitleTag": false, "isVirtual": true, "formatting": null, "isRequired": true, "visualization": null }, { "tagNumber": 1, "tagType": 1, "position": 11, "isSubtitleTag": false, "isVirtual": false, "formatting": null, "isRequired": true, "visualization": null } ], "placeholders": [] } But word in xml paragraph 1 => • 2 => Vietnam's National Assembly has passed a cybersecurity law requiring companies such as Alphabet Inc.'s Google and Facebook Inc. to store all data of Vietnam-based users in the country and open local offices. The measure 3 => has drawn rare dissent from some lawmakers and government leaders as well as local tech groups, who sent a petition to the legislature that warned it would hurt the economy. my solution translate: extract text to 4 segments 1 => • 2 => Vietnam's National Assembly has passed a cybersecurity law requiring companies such as Alphabet Inc.'s Google and Facebook Inc. to store all data of Vietnam-based users in the country and open local offices. ==> and request google API translate and receipt result 3 => The measure ==> and request google API translate and receipt result 4 => has drawn rare dissent from some lawmakers and government leaders as well as local tech groups, who sent a petition to the legislature that warned it would hurt the economy. ==> and request google API translate and receipt result I want concat segment 3 to 4 of my solution translate result for display and when create new doc i want concat segment 3 to segment 2 because it in hyperlink, origin text: "Vietnam's National Assembly has passed a cybersecurity law requiring companies such as Alphabet Inc.'s Google and Facebook Inc. to store all data of Vietnam-based users in the country and open local offices. (segment 2) The measure (segment 3)" because it match text in hyperlink, and i want result translate too. Please help me solution. Tks all