Unicode support in .ott files

PeterSewell commented 3 years ago

Some users would like to be able to use unicode in their sources, e.g. with emacs set-input-method TeX. I imagine that this wouldn't be too hard (though not a trivial change), as most of the Ott internals are just operating on byte sequences. One would have to adapt to unicode-friendly top-level lexing of the .ott source files (with https://github.com/ocaml-community/sedlex instead of ocamllex?), perhaps map the obvious unicode subset (e.g. whatever is supported by that input method) to actual TeX in the TeX output (unless tex distros all support well enough with standard packages), make sure the generated GLR parser is happy with unicode (perhaps a no-op, as it presumably just works on byte sequences and can continue to do so), and fix a few functions that check whether identifiers are alphanumeric etc.

PeterSewell commented 3 years ago

From a very casual look at LaTeX support for unicode math, there seem to be two basic approaches for the output to that: either use some version of LaTeX and packages that natively understand unicode, or have Ott translate. For the former, @ianstark points to https://ctan.org/pkg/unicode-math, but it says it needs XeTeX or LuaTeX and is restricted to a relatively small set of fonts, so perhaps the latter would be more generally useful. For that, one could probably reuse the unicode-to-latex mapping from that package http://mirrors.ctan.org/macros/unicodetex/latex/unicode-math/unicode-math-table.tex, or from http://milde.users.sourceforge.net/LUCR/Math/, or (perhaps more usable in practice?) from whatever the emacs TeX inputmethod uses, or the union of the last and whatever the vim analogue is (taking due care with licencing, obviously). One would have to decide how well to support combining characters. For prover backends, one should check what unicode they each support and provide a translation - probably based on the TeX names - where they don't.

PeterSewell commented 3 years ago

I'm hoping that someone else will take this on at some point. @neel-krishnaswami suggested that it would be useful.

goldfirere commented 1 year ago

I would find this useful. Though I think it's possible to break the task into two pieces:

Get the ott lexer to understand unicode input
Get the TeX formatter to output correctly

My use-case requires only (1), in that I'm happy to provide TeX homs for all my unicode inputs.

ott-lang / ott

Unicode support in .ott files #85