messageformat / parser

A PEG.js parser for ICU MessageFormat strings
https://messageformat.github.io/
MIT License
7 stars 5 forks source link

Definition of ID #5

Closed nkovacs closed 7 years ago

nkovacs commented 7 years ago

The definition of id requires the first character to be ASCII alphanumeric or $ or _, but then allows almost any character (except \t\n\r,.+={}). This is not consistent with ICU, which allows anything that's not Pattern_Syntax or Pattern_White_Space, and it's also not internally consistent, e.g. there's a testcase that says "shouldn't allow characters in variables that aren't valid JavaScript identifiers", and it tests {☺}, but {a☺} would be allowed.

SlexAxton commented 7 years ago

I think the actual specs didn't exist (or I couldn't find them) when I first wrote the parser. And I think a lot of those choices have just followed us since then. I was mostly just trying to reverse engineer the api via the c++ and java docs.

That said, I'm happy to see any of this getting updated to be in line with the real life spec, all with a major version bump.

nkovacs commented 7 years ago

I'm not sure what to do about whitespace. The definition of \s in javascript is different, it doesn't contain \u0085, but it contains a few more unicode characters.

Pattern_White_Space is [\u0009-\u000d \u0085\u200e\u200f\u2028\u2029], \s is [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]

I just did a test in php, and \u3000 is accepted as a variable there (it uses the C port of ICU).

SlexAxton commented 7 years ago

If it's not spec'd, I think if we just pick a set and doc it, it's fine. I'm not terribly concerned about all the edgecase whitespace characters, to be totally honest. Happy to trust your opinion, with slight preference to interop with other langs if you need a tie-breaker.

nkovacs commented 7 years ago

Interop it is then. I wrote a php script that checks all the unicode characters as variable names and exports the results to a json, then I tested it with the parser to make sure the results are the same: https://github.com/nkovacs/icu-messageformat-parser/commit/ed7688d36e490ecfc46e1fd2b79095632a264cd2