Closed stasm closed 1 year ago
Possibly related: UTS #35 is not consistent about "…BNF"
RFC 5234 + RFC 7405 ABNF or W3C XML Notation are both reasonable choices, although it's worth noting that the latter has no mechanism for bounded repetition such as the current
alphanum{3,8}
or equivalent RFC ABNF3*8 alphanum
.
supported by REx, an online parser generator
Note that it is also available as (Java) source, to run offline: https://www.bottlecaps.de/rex/REx.java
The main page says
Command line client Use REx.java instead of this form for invoking REx from a command shell.
Unfortunately, that file is just a CLI tool which makes an HTTP request to the remote server. There's no logic in the file itself.
I'm currently leaning towards choosing ABNF. It's very well defined thanks to RFC 5234 and RFC 7405, and it looks like there are multiple tools in Java, Python, C, and JavaScript which claim to follow the RFC, which is promising.
I also found a fuzzing tool for it: https://www.quut.com/abnfgen/.
Just a couple links to throw in because this issue has a good summary of {,A,B}BNF
variants and their usages.
You linked the Railroad Diagram generator, but that same domain has a useful grammar conversion tool: https://www.bottlecaps.de/convert/
There are even more EBNF derivatives out there, for example Lark. The motivation for Lark and many others seems to be downstream tooling trying to automatically do things with grammars often find the {,A,B}BNF
deficient for what they want to accomplish and so they clean them up or extend them a little bit to work for their particular target use case.
- You linked the Railroad Diagram generator, but that same domain has a useful grammar conversion tool: https://www.bottlecaps.de/convert/
This is great, thanks for the pointer. The converter is one-way, to W3C EBNF, and one of the supported input grammars is ABNF. I was just able to test it with my WIP of the ABNF rewrite and it worked well. This means we can use ABNF and, when neeeded, convert to W3C EBNF and still use REx and the Railroad Diagram generator.
Currently,
spec/message.ebnf
is written using the so-called W3C EBNF. It's one of a few BNF variations, commonly used in W3C and Unicode (I think?).One of the nice-to-have reasons for picking it was that it's also supported by REx, an online parser generator, as well as the Railroad Diagram Generator, both by Gunther Rademacher. Having good tool support and being able to immediately test grammar ideas was beneficial to the rapid development-style of the syntax design process that we went through last year.
That said, I haven't been able to find other tools that supports this variant of BNF out of the box. This makes me a bit uncomfortable. Ideally, we would be able to define the grammar in a way that allows an arbitrary text snippet to be validated as MF2 syntax, parse it into a concrete syntax tree (CST), visualize both the rules and such tree, and even generate random strings that match the grammar, for the purpose of fuzz testing. I'm a bit disappointed by the state of the tooling in this regard.
foo / bar
), optional symbols ([foo]
), and repetitions (n*m(foo)
).|
for alternatives, and Kleene operators for optionals and repetitions (?
,*
,+
).|
for alternatives, brackets for optional symbols ([foo]
), and curly braces for repetitions ({foo}
).Outside the realm of context-free grammars (CFGs), it looks like parser expression grammars (PEGs) are also a popular choice for defining grammars of programming languages. E.g. Python switched to a PEG in PEP 617.
There's also Tree-sitter developed by GitHub, which uses parser combinators written in JavaScript to define grammars, from which it can then generate parsers.
I'm opening this issue to discuss what requirements we have for the formal grammar of MessageFormat and to choose one of the available formats.