Ouch, my bad -- but of course due to "mere mischance", it only popped up in a long text after several correct pages (the SBL bible in USX )... and not in my simple test cases for it... Doh.
Extracted MWE
\begin[papersize=a8, class=book]{document}
\neverindent
\nofolios
\language[main=fr]
Après « Dieu », l'hébreu comporte les deux lettres « Aleph Tav » (la première et la dernière lettre de l'alphabet hébreu), non pas comme un mot, mais comme un marqueur grammatical.
\end{document}
With U+00A0 nobreak-space characters after « and before » (= 4 occurrences here).
Observed
"Aleph" and "marqueur" are weirdly broken:
Analysis
U+00A0 is 2-byte long in UTF8 encoding, and the internal position of items submitted to the node maker turns out to be an offset in bytes (starting at 0 and containing the byte position)
Suggested fix:
--- a/languages/fr.lua
+++ b/languages/fr.lua
@@ -223,9 +223,13 @@ function SILE.nodeMakers.fr:iterator (items)
local removed = 0
for k = 1, #items do
if self:mustRemove(k, items) then
- removed = removed + 1
+ -- the index is actually a character position in the byte stream.
+ -- So we need to take its actual byte length into account.
+ -- For instance, U+00A0 NBSP is 2 bytes long (0xC2 0xA0) in UTF-8.
+ removed = removed + string.len(items[k].text)
else
- items[k].index = items[k].index - removed -- index has changed due to removals
+ -- index has changed due to removals
+ items[k].index = items[k].index - removed
table.insert(cleanItems, items[k])
end
end
The former "+ 1" only works with regular spaces... and introduces an offset mismatch otherwise... which may (or not) lead to unexpected behavior eventually.
SILE 0.14.14 due to #1918
Ouch, my bad -- but of course due to "mere mischance", it only popped up in a long text after several correct pages (the SBL bible in USX )... and not in my simple test cases for it... Doh.
Extracted MWE
With U+00A0 nobreak-space characters after
«
and before»
(= 4 occurrences here).Observed
"Aleph" and "marqueur" are weirdly broken:
Analysis
U+00A0 is 2-byte long in UTF8 encoding, and the internal position of items submitted to the node maker turns out to be an offset in bytes (starting at 0 and containing the byte position)
Suggested fix:
The former "+ 1" only works with regular spaces... and introduces an offset mismatch otherwise... which may (or not) lead to unexpected behavior eventually.