globalizejs / globalize

A JavaScript library for internationalization and localization that leverages the official Unicode CLDR JSON data
https://globalizejs.com
MIT License
4.8k stars 602 forks source link

Bug: Globalize number formatter is incorrect for numeric digits in supplemental plane #922

Open greghuc opened 3 years ago

greghuc commented 3 years ago

Hi there

globalise (v1.7.0) number formatting is incorrect for cldr-data (v36.0.0), when cldr numeric digits are from the UTF-16 supplemental plane (from U+010000 to U+10FFFF).

Short example, discussed below: 44.56 formatted in ccp locale

Based on the formatted value returned by globalise, I initially suspected that individual characters are somehow being represented in globalize as surrogate pairs (so two 16-bit hex values), but only the first of these hex values is returned. There's a worked example below, except I now have some doubts over this theory: for the 4 numeric digits involved, 3 of the digits returned by globalize seem to be the first half of a surrogate pair, but one isn't.

Example (no code)

For the "ccp" locale, digitals 0-9 are "𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿", which have unicode hex codepoints of ["11136", "11137", "11138", "11139", "1113a", "1113b", "1113c", "1113d", "1113e", "1113f"].

So the number 44.56 formatted in ccp should be "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]

What is actually returned from globalise is "��.��" = [ 'd804', 'd804', '2e', 'dd38', 'd804' ]

Using the Surrogate Pair Calculator for the individual characters in "𑄺𑄺.𑄻𑄼" = ["1113a", "1113a", "2e", "1113b", "1113c"]

So maybe globalise is returning the first hex value from each surrogate pair? But dd38 is returned, not D804 (for 1113b)

Example (code)

// Output hex values for Javascript unicode characters
var asUnicodePoints = function(value) {
  return Array.from(value).map(function(codePoint) {
    return codePoint.codePointAt(0).toString(16);
  });
};

// For us locale, works fine
var result = Globalize('us').numberFormatter()(44.56);
console.log(result);
=> 44.56
console.log(asUnicodePoints(result));
=> [ '34', '34', '2e', '35', '36' ]

// For cpp locale, wrongly returns first hex value from each surrogate pair? 
var result = Globalize('ccp').numberFormatter()(44.56);
console.log(result);
=> ��.��
console.log(asUnicodePoints(result));
=> [ 'd804', 'd804', '2e', 'dd38', 'd804' ]

// For ccp locale, the true hex values for formatted 44.56 should be.. 
console.log(asUnicodePoints("𑄺𑄺.𑄻𑄼"));
=> [ '1113a', '1113a', '2e', '1113b', '1113c' ]
rxaviers commented 3 years ago

Thanks for filing the issue and your detailed debugging. I am open to accept a fix. Thanks!

greghuc commented 3 years ago

@rxaviers I'll see what I can do. Any guidance on roughly where in the code I should be looking?

rxaviers commented 3 years ago

Awesome. Numbering system digits are set at https://github.com/globalizejs/globalize/blob/master/src/number/numbering-system-digits-map.js, stored as formatter properties at https://github.com/globalizejs/globalize/blob/master/src/number/format-properties.js#L63, then used here https://github.com/globalizejs/globalize/blob/master/src/number/format.js#L96. Their respective unit tests can be found https://github.com/globalizejs/globalize/blob/master/test/unit/number/format-properties.js and https://github.com/globalizejs/globalize/blob/master/test/unit/number/format.js.

greghuc commented 3 years ago

OK, this issue isn't going to be my highest priority, though I will hopefully get round to it at some point. I believe the issue only affects 4 locales, all related to the base ccp locale: ccp, ccp-u-nu-native, ccp-IN and ccp-IN-u-nu-native.