plutext / docx4j

JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files
https://www.docx4java.org/
2.11k stars 1.2k forks source link

Preserving word order for Arabic, Hebrew #582

Open malthe opened 5 months ago

malthe commented 5 months ago

In Unicode text, consumers of RTL (right-to-left) language text such as Arabic or Hebrew, must identify the string direction, for example by observing the strong Unicode directional property of some glyphs such as Arabic letters.

That is, if for example a paragraph begins with an Arabic letter, we should align the whole paragraph right and render the glyphs right to left as we progress logically through the string.

In our testing, this does not seem to happen automatically in this library; bidi elements are not emitted.

While there's an ArabicScriptProcessor, often times we don't know the specific language of a given paragraph.

Shouldn't this be more or less an automatic process, working out of the box?

plutext commented 5 months ago

If you are creating docx files, you need to set w:pPr/w:bidi and w:rPr/w:rtl appropriately, as well as w:pPr/w:lang .

See:

A program exporting docx then needs to be sensitive to these attributes. For example, docx4j's PDF output via FO should do this correctly.

If these attributes are not present, then the procedure recommended in your Strings on the Web reference might be a good fallback. (I wonder what Word does?)

Or are you suggesting that docx4j is the consumer and as such the methods to add text to a run at https://github.com/plutext/docx4j/blob/VERSION_11_4_12/docx4j-openxml-objects/src/main/java/org/docx4j/wml/R.java#L201 should set appropriate attributes?

malthe commented 5 months ago

We're basically adding a paragraph of text to the main document's content, providing a regular string when creating a Text object:

Text t = factory.createText();
t.setValue(string);

Now, implicit and sometimes explicit, unicode can be bidirectional and it would be convenient if there was a way to create a content element from a string that automatically figured out if bidi elements were necessary.

But we don't know if the string is Arabic, Hebrew, or a mix of languages, which is why something like an Arabic script processor doesn't really make sense. Ideally, this interface should simply follow the best practices for interpreting unicode as a bidirectional language container.

plutext commented 5 months ago

Does https://docs.oracle.com/javase/7/docs/api/java/text/Bidi.html help you?