plutext / docx4j

JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files
https://www.docx4java.org/
2.12k stars 1.2k forks source link

export fo Chinese pPr justification #452

Open plutext opened 3 years ago

plutext commented 3 years ago

Justification is lost. See https://github.com/plutext/docx4j/issues/451 for sample files

plutext commented 3 years ago

Should work OK for paragraphs containing Chinese characters only, since in most if not all fonts, Chinese glyphs are fixed width (so nothing needs to be done).

Punctuation is also OK in most Chinese fonts (fixed with again).

Problems occur if you add roman characters (such as English words).

https://groups.google.com/g/chinesemac/c/1haGW9aKHis?pli=1 says

for Latin text and punctuation, the only monospaced Chinese fonts that I'm aware of are NSimSun (as opposed to the proportional SimSun) and MingLiU (as opposed to the proportional PMIngLiU), which come with Windows. I'd try them. NSimSun might be a better choice.

I tried NSimSun; the English glyphs might be monospaced, but they are a different width to the Chinese ones, so more effort is required.

I also noticed that it seems that FOP adds character spacing after the line breaks have been calculated.

plutext commented 3 years ago

https://www.w3.org/International/articles/typography/justification.en says:

Historically, Chinese was written as Han ideographs, with no punctuation. Under this system, justification was automatic, as the characters fit perfectly into a square grid, and lines could wrap between any two characters. However, the introduction of punctuation in recent centuries, along with its accompanying line-breaking restrictions, plus the increase in mixed-script text (such as the inclusion of European numbers and/or words, phrases, names, and trademarks) has created a need for adjustments within a line.

Punctuation introduced line-breaking restrictions such as not starting a line with a period or closing parentheses; and Latin text, while sometimes typeset in a full-width character style with Chinese-style line-breaking, is also frequently typeset with proportional fonts and line-wrapped or hyphenated according to its usual rules, breaking the Chinese grid. These newer developments thus open up space at the end of a line, which justification needs to deal with.

Chinese notably does not use word spaces, so these do not provide a justification opportunity within the lines; thus justification techniques focus on adjustments to spacing around punctuation, script-change boundaries, and inter-character spacing.

https://github.com/w3c/i18n-drafts/issues/138

Another justification mode for Han is more embedded proportional scripts to be padded with equal space of both sides (mid-sentence) or one side (start of line) and between words, such that the total width is [n-char + (n-1)-interword space] so the character grid is preserved, with hanging punctuation.

plutext commented 3 years ago

See https://www.w3.org/TR/clreq/ Requirements for Chinese Text Layout W3C Working Draft 01 November 2020 especially

e https://www.w3.org/TR/clreq/#mixed_text_composition_in_horizontal_writing_mode "there is tracking or spacing between an adjacent Han character and a Western character of up to one quarter of a Han character width, except at the line start or end."

https://www.w3.org/TR/clreq/#handling_western_text_in_chinese_text_using_proportional_western_fonts includes: Justified text alignment is an important feature of Chinese composition. It is harder to align text as expected when a line contains Western characters. Typically, spacing or tracking is applied equally across the line, but such adjustments are only applied between Han characters or between Han and Western letters. The spacing is not equally distributed between characters in Western words and/or European numerals.

Then there is stuff about Grid Alignment, together with the note: Grid alignment is adopted more often in Traditional Chinese typesetting, whereas use in Simplified Chinese is rare.

Chrome gets text-align="justify" correct, so maybe there's inspiration there for FOP