Enhancement Request: Block-Level HTML Translation

mosugi commented 1 year ago

Hello esteemed developers,

First of all, I'd like to express my sincere gratitude for your hard work in maintaining and improving this great project. Your dedication is invaluable to the entire community.

Recently, I've encountered an issue regarding the translation of Japanese language within HTML content. As you might know, the syntax and structure of Japanese significantly differ from English. This leads to inaccurate and often nonsensical translations when translation is done at the inline HTML element level.

To illustrate, let's consider the following example:

<p>She is <em>looking forward</em> to your visit.</p>

Translated results are as follows:

<p>彼女は<em>楽しみ</em>ご来場ありがとうございました。</p>

Retranslate back to English and it will look like this:

<p>She <em>enjoyed</em> the show, thank you for coming.</p>

Translating this on an element-by-element basis would not yield the intended meaning in Japanese.

Therefore, I propose a feature enhancement where translation happens at the block HTML element level, instead of at the inline level, especially when dealing with languages like Japanese.

For example, the translation feature could be enhanced to behave as follows:

<div>彼女はあなたの訪問を<em>心待ちに</em>している。</div>

Retranslate back to English and it will look like this:

<div>She is <em>looking forward</em> to your visit.</div>

The entire div block is translated as a single unit, which will lead to a more accurate translation.

Google Translate and DeepL translations support the translation of strings containing HTML.

It would be very nice to see features such as inline elements translating at the same time as block elements implemented in the options settings.

I hope this suggestion is taken into consideration and am looking forward to seeing how this project continues to evolve.

Thank you again for your efforts and dedication.

Best regards.

vitonsky commented 1 year ago

Hi, thank you for feedback. Languages diversity in feedback a much important for language tools!

The problem is clear for me, thanks for detailed explaining with examples.

Important approach in Linguist architecture is that Linguist not depends of translators implementations. It make able to use any translator implementation and use all features of Linguist. If we will bind to features of google translator or other service, we can't to use some features of Linguist with other translators who does not support HTML tags translation or implement this translation other way.

Thus, Linguist it is a platform that implement all features itself and allow to use features with any trivial translator implementation. All things translator must do - translate one string and translate array of strings. It is easy to implement.

To solve problem your mention above, we have to improve Linguist behavior, not just to use google translator features to translate HTML.

Let's think and converse how to implement behavior to translate Japanese texts better.

It is good idea, to implement optional feature to translate texts on block level, and enable this feature automatically for Japanese language. Can you please send me some links about this approach? Maybe it is popular idea and we have guides in internet "how to translate HTML with Japanese text". Your opinion are most important, because i can't speak Japanese and i can't measure quality of results.

For now i have few questions that we must answer to implement this feature:

How to handle HTML? We have to describe formal steps to implement algorithm
What is block? You've mention "block" above, if it is significant in algorithm implementation, we have to determine what is block. If you mean HTML block, then site developers may change CSS styles and make any HTML element on page to block with display: block; CSS property. We have to decide how to handle this cases, should we consider CSS styles or not.
How to join text and how to split it back?

About last point

Translators have 2 method translate to translate single string and translateBatch to translate few texts.

If we will detect text block with 3 segments (彼女はあなたの訪問を, 心待ちに, している。), how to translate this segments?

We can join this segments to one string, but then we will got one string as response and it is not clear how to handle this case and insert proper segments to its HTML elements.

On other hand, we can use translateBatch method to translate texts. We will call this method with 3 texts and translators will translate 3 segments as one context. However, i'm not sure all translators will translate 3 texts as one context!

Actually, some translators implement translateBatch method as multiple call translate method, so sentence context will not bound.

We can try to use translateBatch method to translate text segments, but it may not works for some translators implementations, even for google translator. So, if you have any ideas how to implement translation of 3 texts and then split result to 3 segments back, feel free to express your thinks!

mosugi commented 1 year ago

Thanks for your immediate and detailed response. I will describe how my idea and its background.

Relationship Between Block Elements and Inline Elements in Translation

By "block," we are referring to HTML's block elements. For instance,  denotes a paragraph. When writing HTML in Japanese, translating by this unit results in a natural translation outcome.
- https://developer.mozilla.org/en-US/docs/Glossary/Block-level_content
The opposite of a block element is an inline element. As noted in the issue, translating at this level results in word-by-word translation, which can lead to unnatural outcomes.
- https://developer.mozilla.org/en-US/docs/Glossary/Inline-level_content

Proposed Solution

We propose treating segments as block elements, and allowing translation at this level.
- Depending on the structure of a webpage's HTML, a page utilizing HTML appropriately will have its text divided into block elements.
- Based on my research while developing a translation tool, I found that translating at the level of elements such as p h1-h6 ul ol li section article results in natural translation outcomes.
We add a function interface like translateConcat to the custom translator in Linguist, allowing it to accept an array of multiple strings and return a single string.
- By accepting an array of HTML as well as strings, we can prevent the loss of HTML from the translation results. This will make the APIs of Google Translate and DeepL Translate more usable.
We allow the setting of the HTML unit to send to translateConcat in Linguist.
- By sending in units like , translateConcat will be able to consider multiple strings for translation.
We implement translateConcat in the custom translator.
- HTML handling can be implemented according to any translation system. For example, even translation tools that do not support HTML, such as Google Translate and DeepL Translate, can translate at the sentence level instead of the word level by methods like replacing HTML with special characters for translation and then reverting the special characters back to HTML.
translateConcat will replace the original strings with the translation results.

The Need to Replace Sentences by Translating at the Segment Level

As noted at https://en.wikipedia.org/wiki/Word_order, English's Constituent word orders are SVO, while Japanese uses SOV. Word-by-word translation in Linguist can translate smoothly from SVO to SVO, but when translating from SVO to SOV and if an HTML tag is included in the middle of the sentence, the issue detailed previously arises.

vitonsky commented 1 year ago

Could you show example how to format string to google and yandex translators will translate it correct and return string with the same format.

My attempt with format She is looking forward to your visit. for yandex translator:

As you see, format been broken and we can't parse text back.

Keep in mind that google and yandex translators API supports HTML mode to translate text with HTML tags properly, it is good, but we can't rely on this behavior in other translators. We have to invent algorithm on our side, few segments to one text, translate it, and then be able to parse segments from translation back.

translate-tools / linguist