sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.65k stars 98 forks source link

U+00A0 (nbsp) processing in French causes unexpected word breaks #1948

Closed Omikhleia closed 9 months ago

Omikhleia commented 9 months ago

SILE 0.14.14 due to #1918

Ouch, my bad -- but of course due to "mere mischance", it only popped up in a long text after several correct pages (the SBL bible in USX )... and not in my simple test cases for it... Doh.

Extracted MWE

\begin[papersize=a8, class=book]{document}
\neverindent
\nofolios
\language[main=fr]
Après « Dieu », l'hébreu comporte les deux lettres « Aleph Tav » (la première et la dernière lettre de l'alphabet hébreu), non pas comme un mot, mais comme un marqueur grammatical.
\end{document}

With U+00A0 nobreak-space characters after « and before » (= 4 occurrences here).

Observed

"Aleph" and "marqueur" are weirdly broken:

image

Analysis

U+00A0 is 2-byte long in UTF8 encoding, and the internal position of items submitted to the node maker turns out to be an offset in bytes (starting at 0 and containing the byte position)

Suggested fix:

--- a/languages/fr.lua
+++ b/languages/fr.lua
@@ -223,9 +223,13 @@ function SILE.nodeMakers.fr:iterator (items)
   local removed = 0
   for k = 1, #items do
     if self:mustRemove(k, items) then
-      removed = removed + 1
+      -- the index is actually a character position in the byte stream.
+      -- So we need to take its actual byte length into account.
+      -- For instance, U+00A0 NBSP is 2 bytes long (0xC2 0xA0) in UTF-8.
+      removed = removed + string.len(items[k].text)
     else
-      items[k].index = items[k].index - removed -- index has changed due to removals
+      -- index has changed due to removals
+      items[k].index = items[k].index - removed
       table.insert(cleanItems, items[k])
     end
   end

The former "+ 1" only works with regular spaces... and introduces an offset mismatch otherwise... which may (or not) lead to unexpected behavior eventually.