wassemgtk / pseudolocalization-tool

Fork: Automatically exported from code.google.com/p/pseudolocalization-tool
Apache License 2.0
0 stars 0 forks source link

Tool mangles more than desired with ICU plural patterns #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Suppose you have a string like this:

    duplicatesRemovedFragment={0,plural,one{{0} duplicate removed}other{{0} duplicates removed}}

In version 0.2 it gets mangled to this:

    duplicatesRemovedFragment={0,plural,one{{0} \u202Eduplicate\u202C \u202Eremoved\u202C}\u202Eother\u202C{{0} \u202Eduplicates\u202C \u202Eremoved\u202C}}

Oddly, the "one" keyword remains untouched (suggesting that the tool does 
somehow understand that it's a special keyword) yet the "other" keyword has 
been mangled, so at runtime, you get this error:

    Missing 'other' keyword in plural pattern in "{0,plural,one{{0} du ..."

Original issue reported on code.google.com by trejkaz on 30 Jul 2014 at 4:28

GoogleCodeExporter commented 9 years ago
So is this using the Pseudolocalizer command-line tool?  If so, what arguments 
are you passing?  I assume this is in a .properties file?

Without the details to reproduce it, my guess would be in the hack for parsing 
MessageFormat patterns is insufficient at 
http://code.google.com/p/pseudolocalization-tool/source/browse/trunk/java/com/go
ogle/i18n/pseudolocalization/format/JavaProperties.java#142

Original comment by jat@jaet.org on 7 Aug 2014 at 2:29

GoogleCodeExporter commented 9 years ago
Yeah, We're using .properties files, and yeah, that regex seems like it would 
stop at the first }, which explains why it modified the next word after it. It 
should probably permit matched pairs of {} but then there is the other issue of 
the stuff within the innermost {} wanting to be mangled, which seems like it 
could get rather complex.

I ended up making my own tool anyway, due to this and other issues, and ended 
up using ANTLR to parse the formats, because there were a whole host of weird 
edge cases which I found hard to do with regexes.

Original comment by trejkaz on 7 Aug 2014 at 10:36

GoogleCodeExporter commented 9 years ago
It's sad that it is easier to write your own tool instead of patching this one.

This example is a bit tricky, because even with proper parsing, normally 
everything in a placeholder would not be localizable at all.  However, in the 
case of ICU4J plural/choice formats, localizable text occurs in within the 
placeholder itself.  So, you really have to know that this is an ICU4J 
plural/choice format in order to make that text localizable.  I suppose we can 
allow for generic MessageFormat-type messages and if it appears to match 
plural/choice treat it specially - instead of a Placeholder, generate a 
VariantFragment tree instead.

Original comment by jat@jaet.org on 7 Aug 2014 at 3:28