olsak / OpTeX

OpTeX - LuaTeX format with extended Plain TeX macros
http://petr.olsak.net/optex/
35 stars 14 forks source link

allow codepoints >65535 in PDF outlines #22

Closed vlasakm closed 3 years ago

vlasakm commented 3 years ago

Originally, this was supposed to be an issue. For context purposes here is the intended issue message:

Consider the following example:

\fontfam[lm]

\insertoutline{x∈𝕄}

\bye

The produced outline wrongly says x∈필. ((\376\377\000x\042\010\725\104) in PDF) instead of x∈𝕄 ((\376\377\000x\042\010\330\065\335\104) in PDF). (The problem is the 𝕄 at the end: \725\104 vs \330\065\335\104.)

This is because macro \_octalprint in pdfuni-string.opm can't encode characters to two byte pairs (codepoints >=65536). Despite the name UTF-16 does not encode all unicode code points using a single byte pair (16 bits), for higher code points it requires two pairs of bytes (32 bits). This is outlined in the example at https://cs.wikipedia.org/wiki/UTF-16#P%C5%99%C3%ADklad for characters with codepoints <256, <65536, >=65536 (the ones also used in the example above).

While revisiting this I think it would be better to use hexadecimal strings (<FEFF00782208D835DD44> for the example above), which is arguably more readable and also more straightforward to implement. Another possibility is to use the raw bytes directly in (), which is even more compact (12 bytes for the above string including parentheses, compared to 22 bytes for hexadecimal and 39 for octal), but apart from being binary it also brings the trouble of escaping \, ( and )).

The readability or compactness of the resulting PDF string probably doesn't that much - usually it is compressed by the engine anyways.

Here is the commit message:

Previous implementation didn't handle codepoints above 65535, which in UTF-16BE require two pairs of bytes.

This implemenation uses hexadecimal pdf strings instead of literal strings with escaping. This allows easy translation of code points.

_hexprint could be more efficient than converting _directlua contents to TeX token list (with catcode and macro handling, etc.), then discarding that information, converting to string, passing it to Lua, compiling to bytecode and executing it, all for each and every character. Although more efficient scheme would require changes to how OpTeX handles Lua at the core, so not this time. At least this is more efficient than the replaced code.

As I say in the commit message, this is by no means efficient, although hopefully there aren't that many outlines / PDF strings.

My first idea was to implement and define the _hexprint command in Lua (essentialy as \luafunction and \luadef it. But this would be better with an allocator for more of these. Which means more Lua code, that needs to be also loaded in the correct order and preloaded into the format (also in the right order).

Pure Lua conversion to string to string conversion is also possible, although I would have to learn how to do it efficiently due to to string interning.

Pure TeX conversion character by character would also be possible, so feel free to reject this version (or replace the Lua code in additional commits to this pull request).

If you decide to merge (with additional commits or without), consider "squash and merge" or "rebase and merge". (https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-merges)

vlasakm commented 3 years ago

Slight correction, even though white space in hexadecimal strings should be ignored, due to how LuaTeX handles pdf strings, it is safer to mention it as forbidden. (The implementation however doesn't emit any whitespace.)

olsak commented 3 years ago

Thank you, I'll do more tests with it

olsak commented 3 years ago

I noticed that pdfinfo data (pdfinfo ctustyle-doc.pdf) is corrupted after this commit.

olsak commented 3 years ago

The problem was in the fact that original \pdfunidef didn't generate outer (...) but suggested \pdfunidef do generate outer <...>. It does not matter if \pdfoutline primitive is used, but \pdfinfo needs more correct syntax. The \pdfinfo is used in ctustyle3.tex only (no in OpTeX sources). I corrected ctustyle3.tex (see new commit) and I fully accepted this pull request. Thank you.

vlasakm commented 3 years ago

There is one catch that I forgot to mention. I used bitwise operators & and >> which require Lua 5.3, which is default since LuaTeX 1.08 (2018-08-28), but in TeX Live only since TeX Live 2019 (LuaTeX 1.10.0 2019-03-15).

If backwards compatibility for TeX Live 2018 and earlier is required, builtin bit32 library can be used instead (and it would require more care to make sure it works on both Lua 5.3 and Lua 5.2).

Sorry for the trouble.

vlasakm commented 3 years ago

The following should fix a stupid bug I introduced and also allow old LuaTeX versions with Lua 5.2 (I tested with Version 1.07.0 (TeX Live 2018), Compiled with lua version 5.2.4).

--- a/optex/base/pdfuni-string.opm
+++ b/optex/base/pdfuni-string.opm
@@ -8,15 +8,15 @@
    \_cod -----------------------------

 \bgroup
-\_catcode`\&=12 \_catcode`\%=12
+\_catcode`\%=12
 \_gdef\_hexprint{\_directlua{
    local num = token.scan_int()
-   if num <= 0x10000 then
+   if num < 0x10000 then
       tex.print(string.format("%04X", num))
    else
       num = num - 0x10000
-      local high = (num >> 10) + 0xD800
-      local low = (num & 0x3FF) + 0xDC00
+      local high = bit32.rshift(num, 10) + 0xD800
+      local low = bit32.band(num, 0x3FF) + 0xDC00
       tex.print(string.format("%04X%04X", high, low))
    end
 }}

Once again, sorry for the trouble.

olsak commented 3 years ago

Thank for noticing. I corrected this in my commit https://github.com/olsak/OpTeX/commit/8414eadf4f568cef2670601a30e71b266c88c326 I tested it with old LuaTeX now.

There is another issue: I am not sure how to process the character code U+10000. Your lua code process it as 16bit but maybe it is wrong?

vlasakm commented 3 years ago

You probably missed my last comment. Note that 8414eadf4f568cef2670601a30e71b266c88c326 doesn't work in Lua 5.3.

$ lua5.2
Lua 5.2.4  Copyright (C) 1994-2015 Lua.org, PUC-Rio
> print(string.format("%04X", 0x1D544 / 1024))
0075
$ lua5.3
Lua 5.3.6  Copyright (C) 1994-2020 Lua.org, PUC-Rio
> print(string.format("%04X", 0x1D544 / 1024))
stdin:1: bad argument #2 to 'format' (number has no integer representation)
stack traceback:
        [C]: in function 'string.format'
        stdin:1: in main chunk
        [C]: in ?
olsak commented 3 years ago

OK, I hope that this issue is closed by my last commit. Obscure Lua.