tlaplus / rfcs

RFCs for changes to the TLA+ specification language
MIT License
13 stars 0 forks source link

Proposal to add Unicode support to the TLA+ language standard and tools #5

Closed ahelwer closed 1 month ago

ahelwer commented 2 years ago

TLA+ Unicode Support Proposal

Motivation

TLA+ specifications can be translated into a "pretty-printed" form with LaTeX, but this is not how developers experience them when writing a spec. Within the past decade, UTF-8 has become so widely supported that any program limited to ASCII can be seen as deficient. Supporting Unicode in TLA+ provides two main benefits:

  1. Greater inclusivity of cultures where English is not the dominant language
  2. Improved readability while writing a spec

Proposed Changes

  1. Allowing a broad but restricted set of Unicode codepoints in identifiers, as in id == ...
  2. Supporting with a finite set of LaTeX-like \name symbols in identifiers, for example to indicate Greek alphabet characters
  3. Allowing arbitrary Unicode codepoints in strings
  4. Allowing specified alternative Unicode symbols for various keywords and operators

Challenges

  1. SANY does not currently support Unicode, and previous attempts to add it were met with difficulty
  2. It is difficult (although not impossible) to switch between ASCII and Unicode spec representations while maintaining the vertical alignment of conjunction & disjunction lists (henceforth called "jlists")
  3. As opposed to ASCII which basically looks good in any monospace font, not all commonly-used monospace fonts have aesthetically appealing Unicode codepoint renderings; nor do they render most Unicode codepoints with a fixed width
  4. Calculation of column position is important when parsing jlists; however, calculating the column position of a character is significantly more difficult in UTF-8 than ASCII, since Unicode codepoints have variable byte length and some codepoints (such as accent modifiers) have zero displayed width

Required Work

  1. Decide on canonical Unicode equivalents of ASCII keywords & operators (proposal here)
  2. Decide whether ASCII symbols should have a single canonical Unicode equivalent, or accept multiple possible codepoints (example: accept both and for -+->)
  3. Decide which subset of Unicode codepoints should be admitted in identifiers (ex. CJK character sets)
  4. Decide on the finite set of LaTeX-like symbols admitted in identifiers, and their Unicode equivalents (ex. \sigma and σ)
  5. Create a tool to convert specs between ASCII and Unicode while maintaining jlist structure (implementation using tree-sitter grammar complete)
  6. Create tools for various popular editors to rewrite ASCII symbols to their Unicode equivalents as the user types them (implemented for Neovim)
  7. Update syntax highlighting tools to accept Unicode symbol alternatives (implemented in tree-sitter-tlaplus)
  8. Update the SANY and TLAPM parsers to accept Unicode symbols

Prior Work & Discussion

  1. Recent discussion associated with tree-sitter grammar [link]
  2. Original thread by Ron Pressler several years ago [link]
  3. Toolbox beta release with Unicode support [link]
  4. Issue tracking prior work on the tlaplus tools [link]
  5. Vim plugin which displays symbols in Unicode instead of rewriting them [link]
xxyzzn commented 2 years ago

There are terabytes of program code posted on the web. It would be useful to do an unbiased sampling of it and see what fraction uses non-ascii characters.

Leslie

zwergziege commented 2 years ago

For the sake of completeness: Adding Unicode support in strings requires minor changes in the fingerprinting code of strings. See https://github.com/tlaplus/tlaplus/pull/685#discussion_r742118597

ahelwer commented 2 years ago

Spoke with some Chinese and Japanese language users who said they don't often use their language's characters in function/variable names for whatever reason, although it would be nice to support unicode strings for printing error messages with assert, for example.

Shreyas4991 commented 2 years ago

There are terabytes of program code posted on the web. It would be useful to do an unbiased sampling of it and see what fraction uses non-ascii characters. Leslie

In modern theorem provers and dependently typed languages, unicode math and symbol characters are extensively used. It is much cleaner to read.

ahelwer commented 1 month ago

This has been completed, although extending the set of valid identifiers to Greek characters was not done. Symbols for the Nat, Int, and Real sets were added however.