Choose BNF syntax for describing the grammar

stasm commented 1 year ago

Currently, spec/message.ebnf is written using the so-called W3C EBNF. It's one of a few BNF variations, commonly used in W3C and Unicode (I think?).

One of the nice-to-have reasons for picking it was that it's also supported by REx, an online parser generator, as well as the Railroad Diagram Generator, both by Gunther Rademacher. Having good tool support and being able to immediately test grammar ideas was beneficial to the rapid development-style of the syntax design process that we went through last year.

That said, I haven't been able to find other tools that supports this variant of BNF out of the box. This makes me a bit uncomfortable. Ideally, we would be able to define the grammar in a way that allows an arbitrary text snippet to be validated as MF2 syntax, parse it into a concrete syntax tree (CST), visualize both the rules and such tree, and even generate random strings that match the grammar, for the purpose of fuzz testing. I'm a bit disappointed by the state of the tooling in this regard.

The original BNF which doesn't have some quality-of-life improvements of its successors, like syntax for alternatives or repetitions.
ABNF (Augmented BNF) which adds alternatives (foo / bar), optional symbols ([foo]), and repetitions (n*m(foo)).
- ABNF has been standardized as RFC 5234.
- This variant is also defined at https://www.w3.org/Notation.html.
Despite the above, some W3C standards (notably XML) use a different notation, which is commonly referred to as the W3C EBNF.
- It uses | for alternatives, and Kleene operators for optionals and repetitions (?, *, +).
- This variant is also used in some of Unicode's documents.
There's also another thing called EBNF defined by ISO/IEC 14997, sometimes also called the Wirth syntax notation. It uses | for alternatives, brackets for optional symbols ([foo]), and curly braces for repetitions ({foo}).
- Trivia: This one is based on Niklaus Wirth's 1977 paper titled What can we do about the unnecessary diversity of notation for syntactic definitions? I sympathize with the question!
There exist other variations, too, each with a different set of "minor modifications". E.g. a legacy search server in Windows used to have its own EBNF. Interestingly, it defined a special syntax for denoting a requiring space, which is something I wish was easily possible in other EBNFs (see #340).
Lastly, there are BNF variants oriented towards parser generation, like the ones used by Yacc and Bison. They include snippets of C code ("actions") which are used to control the shape of the parse tree.

Outside the realm of context-free grammars (CFGs), it looks like parser expression grammars (PEGs) are also a popular choice for defining grammars of programming languages. E.g. Python switched to a PEG in PEP 617.

There's also Tree-sitter developed by GitHub, which uses parser combinators written in JavaScript to define grammars, from which it can then generate parsers.

I'm opening this issue to discuss what requirements we have for the formal grammar of MessageFormat and to choose one of the available formats.

gibson042 commented 1 year ago

Possibly related: UTS #35 is not consistent about "…BNF"

RFC 5234 + RFC 7405 ABNF or W3C XML Notation are both reasonable choices, although it's worth noting that the latter has no mechanism for bounded repetition such as the current alphanum{3,8} or equivalent RFC ABNF 3*8 alphanum.

mihnita commented 1 year ago

supported by REx, an online parser generator

Note that it is also available as (Java) source, to run offline: https://www.bottlecaps.de/rex/REx.java

The main page says

Command line client Use REx.java instead of this form for invoking REx from a command shell.

stasm commented 1 year ago

Unfortunately, that file is just a CLI tool which makes an HTTP request to the remote server. There's no logic in the file itself.

stasm commented 1 year ago

I'm currently leaning towards choosing ABNF. It's very well defined thanks to RFC 5234 and RFC 7405, and it looks like there are multiple tools in Java, Python, C, and JavaScript which claim to follow the RFC, which is promising.

I also found a fuzzing tool for it: https://www.quut.com/abnfgen/.

alerque commented 1 year ago

Just a couple links to throw in because this issue has a good summary of {,A,B}BNF variants and their usages.

You linked the Railroad Diagram generator, but that same domain has a useful grammar conversion tool: https://www.bottlecaps.de/convert/
There are even more EBNF derivatives out there, for example Lark. The motivation for Lark and many others seems to be downstream tooling trying to automatically do things with grammars often find the {,A,B}BNF deficient for what they want to accomplish and so they clean them up or extend them a little bit to work for their particular target use case.

stasm commented 1 year ago

You linked the Railroad Diagram generator, but that same domain has a useful grammar conversion tool: https://www.bottlecaps.de/convert/

This is great, thanks for the pointer. The converter is one-way, to W3C EBNF, and one of the supported input grammars is ABNF. I was just able to test it with my WIP of the ABNF rewrite and it worked well. This means we can use ABNF and, when neeeded, convert to W3C EBNF and still use REx and the Railroad Diagram generator.

unicode-org / message-format-wg

Choose BNF syntax for describing the grammar #342