unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
236 stars 34 forks source link

Choose BNF syntax for describing the grammar #342

Closed stasm closed 1 year ago

stasm commented 1 year ago

Currently, spec/message.ebnf is written using the so-called W3C EBNF. It's one of a few BNF variations, commonly used in W3C and Unicode (I think?).

One of the nice-to-have reasons for picking it was that it's also supported by REx, an online parser generator, as well as the Railroad Diagram Generator, both by Gunther Rademacher. Having good tool support and being able to immediately test grammar ideas was beneficial to the rapid development-style of the syntax design process that we went through last year.

That said, I haven't been able to find other tools that supports this variant of BNF out of the box. This makes me a bit uncomfortable. Ideally, we would be able to define the grammar in a way that allows an arbitrary text snippet to be validated as MF2 syntax, parse it into a concrete syntax tree (CST), visualize both the rules and such tree, and even generate random strings that match the grammar, for the purpose of fuzz testing. I'm a bit disappointed by the state of the tooling in this regard.

Outside the realm of context-free grammars (CFGs), it looks like parser expression grammars (PEGs) are also a popular choice for defining grammars of programming languages. E.g. Python switched to a PEG in PEP 617.

There's also Tree-sitter developed by GitHub, which uses parser combinators written in JavaScript to define grammars, from which it can then generate parsers.

I'm opening this issue to discuss what requirements we have for the formal grammar of MessageFormat and to choose one of the available formats.

gibson042 commented 1 year ago

Possibly related: UTS #35 is not consistent about "…BNF"

RFC 5234 + RFC 7405 ABNF or W3C XML Notation are both reasonable choices, although it's worth noting that the latter has no mechanism for bounded repetition such as the current alphanum{3,8} or equivalent RFC ABNF 3*8 alphanum.

mihnita commented 1 year ago

supported by REx, an online parser generator

Note that it is also available as (Java) source, to run offline: https://www.bottlecaps.de/rex/REx.java

The main page says

Command line client Use REx.java instead of this form for invoking REx from a command shell.

stasm commented 1 year ago

Unfortunately, that file is just a CLI tool which makes an HTTP request to the remote server. There's no logic in the file itself.

stasm commented 1 year ago

I'm currently leaning towards choosing ABNF. It's very well defined thanks to RFC 5234 and RFC 7405, and it looks like there are multiple tools in Java, Python, C, and JavaScript which claim to follow the RFC, which is promising.

I also found a fuzzing tool for it: https://www.quut.com/abnfgen/.

alerque commented 1 year ago

Just a couple links to throw in because this issue has a good summary of {,A,B}BNF variants and their usages.

stasm commented 1 year ago

This is great, thanks for the pointer. The converter is one-way, to W3C EBNF, and one of the supported input grammars is ABNF. I was just able to test it with my WIP of the ABNF rewrite and it worked well. This means we can use ABNF and, when neeeded, convert to W3C EBNF and still use REx and the Railroad Diagram generator.