unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
228 stars 33 forks source link

Consider adding a token grammar #729

Open catamorphism opened 5 months ago

catamorphism commented 5 months ago

Programming languages typically have both a lexical grammar, which describes the structure of tokens, and a token grammar, which describes the syntax of the language and in which terminals are tokens.

The MessageFormat grammar is a lexical grammar; there is no token grammar. Or, perhaps, a single grammar serves as both (with tokens as single characters), depending on how you look at it.

In the future, I think it would be worth refactoring the grammar so as to create a lexical grammar, describing tokens (this can be done with regexps for most languages); then using it to create a token grammar. This separates out describing the syntax from the details of where required and optional whitespace goes, for example.

For example (not the simplest one, unfortunately), see the JavaScript lexical grammar and the token grammar for expressions (the entire token grammar is split across a few different chapters of the JS spec).

Probably some of the other implementations already use a separate lexer and parser, but I chose to write a combined lexer and parser for MessageFormat so that I could tell if I was following the spec exactly (and because I already had to hand-write the parser, parser generators not being a good option in ICU4C). Without a token grammar as part of the spec, it's hard to do that (writing a separate lexer and parser effectively introduces an ad hoc token grammar).

The trouble with the approach I chose is that there are many apparent syntactic ambiguities involving whitespace, which would probably be much easier to handle in an implementation that tokenizes the input before parsing it. Having a separate token grammar would both make it easier for implementors to verify that their front-ends conform to the spec, and make it easier for everyone to understand the syntax.

eemeli commented 5 months ago

This was closed by mistake.