messageformat / parser

A PEG.js parser for ICU MessageFormat strings
https://messageformat.github.io/
MIT License
7 stars 5 forks source link

Make the parser conform to ICU MessageFormat #3

Closed nkovacs closed 7 years ago

nkovacs commented 7 years ago

Apostrophes are now handled correctly, emulating ICU's default DOUBLE_OPTIONAL behavior.

Octothorpe is handled correctly, and can be escaped using apostrophes, but only inside plural (depending on strictNumberSign).

A single apostrophe only starts quoted literal text if it immediately precedes a curly brace ({}), or, if inside a plural, an octothorpe (#). The parser now supports the strictNumberSign option, since that determines whether a quoted octothorpe is parsed as '#' or just #.

Since choice format isn't supported, the pipe symbol never causes an apostrophe to start quoted literal text.

Parameters to functions may contain whitespace and quoted special characters, but argStyle is still trimmed and split into multiple parameters. A new option, strictFunctionParams, activates ICU-compatible parsing, which parses everything from the second comma to the closing curly brace as a single "argStyleText" parameter.

Fixes #1, fixes #2

jsf-clabot commented 7 years ago

CLA assistant check
All committers have signed the CLA.

nkovacs commented 7 years ago

I've split the function parameter stuff into https://github.com/messageformat/parser/pull/4. It should be backwards-compatible.

nkovacs commented 7 years ago

\s doesn't work in peg.js. It has to be done manually: https://github.com/nkovacs/icu-messageformat-parser/commit/a0be6aadc0141dcde1735a41df09725edfef5cbd

Also, the identifier syntax in the Java implementation is [^[[:Pattern_Syntax:][:Pattern_White_Space:]]]+, where Pattern_Syntax is this: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:Pattern_Syntax:%5D&abb=on&g= and Pattern_White_Space is http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&abb=on&g=&i= See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/MessageFormat.html, http://icu-project.org/apiref/icu4j/com/ibm/icu/text/SelectFormat.html and http://icu-project.org/apiref/icu4j/com/ibm/icu/text/PluralFormat.html

The id definition in messageformat-parser is a bit inconsistent. The first character has to be ascii alphanumeric or $ or _, but then the rest of the characters can be almost anything.