globalizejs / globalize

A JavaScript library for internationalization and localization that leverages the official Unicode CLDR JSON data
https://globalizejs.com
MIT License
4.8k stars 605 forks source link

Non breaking space and breaking space #288

Closed pergardebrink closed 10 years ago

pergardebrink commented 10 years ago

I'm new to both version 0.1.1 and 1.0.0-alpha and have never used jquery globalize before so I might have misunderstood something, but I have a small issue with version 1.0.0-alpha5 (as I plan to move there from 0.1.1 when it's stable).

In the previous version (0.1.1), if run the following code:

Globalize.culture('sv-SE'); // globalize.culture.sv-SE.js is loaded
var number = Globalize.parseFloat("123 456,78"); // Gives 123456.78 as expected
var number2 = Globalize.parseFloat("123" + String.fromCharCode(160) + "456,78"); // Gives 123456.78 also as expected

But if I run the following code in 1.0.0.alpha-5:

Globalize.locale('sv'); // numbers module, cldr loaded and main/sv/numbers.json is also loaded
var number = Globalize.parseNumber("123 456,78"); // NaN
var number2 = Globalize.parseNumber("123" + String.fromCharCode(160) + "456,78"); // Gives 123456.78

Since an enduser probably (definitely) won't type the space as a non breaking space, any conversion will fail. If I change the group property in numbers.json to be a breaking space instead, then of course the parse will work, but then my value provided from the server won't work since I use .NET to format my number with swedish culture: (C#)

var number = 123456.78;
var culture = CultureInfo.CreateSpecificCulture("sv-SE");
number.ToString("N", culture); // Gives 123 456,78 with a non breaking space as a thousand separator
rxaviers commented 10 years ago

Hi @pergardebrink, thanks for your clear description.

As you have pointed out, Globalize deduces the grouping separator symbol from the CLDR content. Therefore, all it "knows" comes from that data set. If the sv grouping separator is defined as character 160 (non-breaking space), that's what it's going to use.

On Globalize, we make sure this will always be true:

var sv = Globalize("sv");
sv.parseNumber(sv.formatNumber(123456.78)) === 123456.78; // true

We don't have any specific rules/conditions on the parser code like "if grouping separator is 160, also try 32". As of now, I think this current behavior is correct.

@scottgonzalez, @jzaefferer, @srl295 any ideas?

Anyway, if you want to allow user to input 32 (breaking space) as an alternative grouping separator, which I agree it makes sense in your case, this could be used:

sanitezedInput = "123 456,78".replace( "\x20", "\xa0" ); // 20 is the hex for 32, a0 is the hex for 160.
sv.parseNumber( sanitezedInput );

TR35 defines this: (link)

For the sign, decimal separator, percent, and per mille, use a set of
all possible characters that can serve those functions. For example, the
decimal separator set could include all of [.,']. (The actual set of
characters can be derived from the number symbols in the By-Type charts
[ByType], which list all of the values in CLDR.) To disambiguate, the
decimal separator for the locale must be removed from the "ignore" set,
and the grouping separator for the locale must be removed from the
decimal separator set. The same principle applies to all sets and
symbols: any symbol must appear in at most one set.

Although we don't fully implement this heuristics in Globalize (it doesn't parse the number string using all loaded grouping separators, but the locale one), note that even implementing that would not solve your problem. Because, no language defines 32 (breaking space) as a grouping separator.

pergardebrink commented 10 years ago

Yes, I'll probably have to use some sort of sanitization as you suggest. My application will use the culture that the user specifies (from a list of all .NET supported cultures) so there are probably more cultures other than swedish that specifies non breaking space as a grouping character.

(The reason I started this issue was that the 0.1.1 version did allow me to specify both non breaking and breaking space and was curious if it was a bug or not)

Thanks for quick reply!

rxaviers commented 10 years ago

Lets wait on input from cc'ed people above. I'm open to suggestion. But, I very much dislike to include if else if specifics/exceptions with content hardcoded.

+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br

pergardebrink commented 10 years ago

I've been thinking and reading since yesterday and I think that Globalize really should support both non breaking and breaking space even if the CLDR says non breaking as a grouping character.

I think that most developers not familiar with cultures that uses space as a grouping separator probably won't know this until they are hit by the first bug report from a swedish or french end user (or any other that uses it).

I've found some info on unicode.org suggesting that you should use a more "lenient parsing" that if the grouping character is non breaking space, all whitespace characters should match. http://unicode.org/reports/tr35/#Loose_Matching

  • Normalize to NFKC; thus no-break space will map to space; half-width katakana will map to full-width.
rxaviers commented 10 years ago

Excellent. So, let's do it.

rxaviers commented 10 years ago

The documentation lead me to the questions below. I have sent that to the CLDR mailing list and will update here as I get replies.

If anyone knows the answers, please just let me know.

7.2 Loose Matching Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:

  • Remove "." from currency symbols and other fields used for matching, and also from the input string unless:
    • "." is in the decimal set, and
    • its position in the input string is immediately before a decimal digit
  • Ignore all format characters: in particular, ignore the RLM and LRM used to control BIDI formatting.

Where do I find a list of all format characters?

  • Ignore all characters in [:Zs:] unless they occur between letters. (In the heuristics below, even those between letters are ignored except to delimit fields)

Where do I find a list of all [:Zs:] characters?

  • Map all characters in [:Dash:] to U+002D HYPHEN-MINUS

Where do I find a list of all [:Dash:] characters?

  • Use the data in the element to map equivalent characters (for example, curly to straight apostrophes). Other apostrophe-like characters should also be treated as equivalent, especially if the character actually used in a format may be unavailable on some keyboards. For example:
    • U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as U+2018 LEFT SINGLE QUOTATION MARK (‘).
    • U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
    • U+05F3 HEBREW PUNCTUATION GERESH (‎׳) might be typed instead as U+0027 APOSTROPHE.

Except for the U+05F3 example, the other two cannot be found in http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json. Are both the "other apostrophe-like characters". Where do I find a complete list of the apostrophe-like characters? Do mappings follow an algorithm, algebric formula or lookup table?

On http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data, there's:

There is more than one possible fallback: the recommended usage is that when a character value is not in the desired repertoire the following process is used, whereby the first value that is wholly in the desired repertoire is used.

  • toNFC(value)
  • other canonically equivalent sequences, if there are any
  • the explicit substitutes value (in order)
  • toNFKC(value)

Does it mean that when the character being looked up is not found, the above process should be followed? Where do I find the definition of toNFC(), toNFC(), canonically equivalence and explicit substitutes?

  • Apply mappings particular to the domain (i.e., for dates or for numbers, discussed in more detail below)

Where?

  • Apply case folding (possibly including language-specific mappings such as Turkish i)

Where do I find more information about it?

  • Normalize to NFKC; thus no-break space will map to space; half-width katakana will map to full-width.

Are both mappings (no-break space and half-width katakana) all it's about, or are there any other NFKC normalizations that should be done? Where do I find a complete list of what should be done? Do mappings follow an algorithm, algebric formula or lookup table?

Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.

"NA f." is the currency symbol for ANG (Netherlands Antillean guilder, aka Netherlands Antilles Florin according to wikipedia). nl-CW and nl-SX defines ANG symbol as NAf.. All other locales define it as ANG.

Following the above recommendation (to map NA f. into naf), how is implementation supposed to know naf is ANG? Where do I find a mapping between naf and ANG?

arschmitz commented 10 years ago

@rxaviers Ping me about this i have some experience with NFC and NFKC as will as js implementation of these.

rxaviers commented 10 years ago

For the record, @arschmitz has worked with Unicode normalization in his arschmitz/jquery-pr project, where he used walling/unorm/.../unorm.js for the NFC and etc normalizations.

rxaviers commented 10 years ago

Also, I have received answers from CLDR mailing list: https://gist.github.com/rxaviers/76762da0ea8d3335f263

rxaviers commented 10 years ago

The ES6 String.prototype.normalize seems to be the way to go (about NFC and NFKC) + using unorm.js shim to polyfill that in the meanwhile.

// Comparing 160 no-break space with 32:
" " === " "; // false
" ".normalize("NFKC") === " ".normalize("NFKC"); // true

The problem is that unorm.js currently embeds the normalization lookup data, making it 36.6KB big (minified+gzipped), which is 10x bigger than Globalize and its number module together. While it's not a problem for backend application, it may be way too much for frontend. Out of curiosity, stripping the embedded data out from unorm.js makes it 2.0KB (minified+gzipped).

arschmitz commented 10 years ago

@rxaviers ah cool that es6 String.prototype.normalize has actually landed in chrome and firefox now it had not yet when I wrote arschmitz/jquery-pr that means I can actually remove unorm.js now! Since jquery-pr is a chome extension :)

rxaviers commented 10 years ago

Closed in favor of the broader scope #292 (Loose Matching).