Closed ChristianGruen closed 8 months ago
Yes, the current mechanism is very clumsy. I think the original intent in XSLT 1.0 was probably to define presentation "at arm's length" so that the logic didn't need to change if the output format changed, but that can be achieved perfectly well by putting the options in a global variable.
I think that language-specific default settings would be sensible:
format-number(123.45, '#.##0,00', 'de')
As known from the other functions for formatting numbers and dates, it could be up to the implementation to decide which languages are supported. The defaults could be overwritten by custom decimal-format declarations in the prolog to ensure that a setting is applied, even if an implementation does not support it.
I'm not convinced this would give good interoperability. Consider Arabic for example: should it default to using western or eastern decimal digits? Both are in widespread use, and the idea that everyone with a particular (country, language) combination uses the same conventions is fundamentally misguided. This doesn't matter too much if it merely affects the format of the output, but it does matter if it makes a picture string valid in one implementation and invalid in another.
I agree, there are cases which are easier to handle and others are more sophisticated. I think the same is true for formatting integers and dates: The rules are rich and sophisticated, but for more advanced use cases (such as spelling out correct hiragana for numbers with Japanese counter words, or considering declension of numerals in Russian), you’ll be lost without writing custom code.
With ICU and Java, it’s fairly straightforward to choose language-specific formatting rules. I haven’t checked if there are flags to e.g. control formatting for Arabic numbers, and it could be that ICU has really taken the wrong path. From a German perspective, though, it’s restrictive that an implementation cannot provide sane defaults for Non-English users.
This is how ICU formats integers with different locales:
Result | Locales |
---|---|
1,234,567 | ak, ak_GH, am, am_ET, ar_AE, ar_EH, asa, asa_TZ, bem, bem_ZM, bez, bez_TZ, bm, bm_ML, bo, bo_CN, bo_IN, ce, ce_RU, ceb, ceb_PH, cgg, cgg_UG, chr, chr_US, cy, cy_GB, dav, dav_KE, doi, doi_IN, ebu, ebu_KE, ee, ee_GH, ee_TG, en, en_001, en_150, en_AE, en_AG, en_AI, en_AS, en_AU, en_BB, en_BI, en_BM, en_BS, en_BW, en_BZ, en_CA, en_CC, en_CK, en_CM, en_CX, en_CY, en_DG, en_DM, en_ER, en_FJ, en_FK, en_FM, en_GB, en_GD, en_GG, en_GH, en_GI, en_GM, en_GU, en_GY, en_HK, en_IE, en_IL, en_IM, en_IO, en_JE, en_JM, en_KE, en_KI, en_KN, en_KY, en_LC, en_LR, en_LS, en_MG, en_MH, en_MO, en_MP, en_MS, en_MT, en_MU, en_MV, en_MW, en_MY, en_NA, en_NF, en_NG, en_NR, en_NU, en_NZ, en_PG, en_PH, en_PK, en_PN, en_PR, en_PW, en_RW, en_SB, en_SC, en_SD, en_SG, en_SH, en_SL, en_SS, en_SX, en_SZ, en_TC, en_TK, en_TO, en_TT, en_TV, en_TZ, en_UG, en_UM, en_US, en_VC, en_VG, en_VI, en_VU, en_WS, en_ZM, en_ZW, es_419, es_BR, es_BZ, es_CU, es_DO, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_SV, es_US, fil, fil_PH, ga, ga_GB, ga_IE, gd, gd_GB, guz, guz_KE, gv, gv_IM, ha, ha_GH, ha_NE, ha_NG, haw, haw_US, he, he_IL, ig, ig_NG, ii, ii_CN, ja, ja_JP, jmc, jmc_TZ, kam, kam_KE, kde, kde_TZ, ki, ki_KE, kln, kln_KE, kn, kn_IN, ko, ko_KP, ko_KR, kok, kok_IN, ks_Deva, ks_Deva_IN, ksb, ksb_TZ, kw, kw_GB, lag, lag_TZ, lg, lg_UG, lkt, lkt_US, luo, luo_KE, luy, luy_KE, mai, mai_IN, mas, mas_KE, mas_TZ, mer, mer_KE, mg, mg_MG, mgo, mgo_CM, mi, mi_NZ, mn, mn_MN, ms, ms_MY, ms_SG, mt, mt_MT, naq, naq_NA, nd, nd_ZW, nus, nus_SS, nyn, nyn_UG, om, om_ET, om_KE, pcm, pcm_NG, qu, qu_EC, qu_PE, rof, rof_TZ, rwk, rwk_TZ, saq, saq_KE, sbp, sbp_TZ, sd_Deva, sd_Deva_IN, si, si_LK, sn, sn_ZW, so, so_DJ, so_ET, so_KE, so_SO, sw, sw_KE, sw_TZ, sw_UG, ta_MY, ta_SG, teo, teo_KE, teo_UG, th, th_TH, ti, ti_ER, ti_ET, to, to_TO, ug, ug_CN, ur, ur_PK, vai, vai_Latn, vai_Latn_LR, vai_Vaii, vai_Vaii_LR, vun, vun_TZ, xog, xog_UG, yi, yi_001, yo, yo_BJ, yo_NG, yue, yue_Hans, yue_Hans_CN, yue_Hant, yue_Hant_HK, zh, zh_Hans, zh_Hans_CN, zh_Hans_HK, zh_Hans_MO, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu, zu_ZA |
1.234.567 | ar_DZ, ar_LY, ar_MA, ar_TN, ast, ast_ES, az, az_Cyrl, az_Cyrl_AZ, az_Latn, az_Latn_AZ, bs, bs_Cyrl, bs_Cyrl_BA, bs_Latn, bs_Latn_BA, ca, ca_AD, ca_ES, ca_FR, ca_IT, da, da_DK, da_GL, de, de_BE, de_DE, de_IT, de_LU, dsb, dsb_DE, el, el_CY, el_GR, en_AT, en_BE, en_DE, en_DK, en_NL, en_SI, es, es_AR, es_BO, es_CL, es_CO, es_EA, es_EC, es_ES, es_GQ, es_IC, es_PH, es_PY, es_UY, es_VE, eu, eu_ES, fo, fo_DK, fo_FO, fr_LU, fr_MA, fur, fur_IT, fy, fy_NL, gl, gl_ES, hr, hr_BA, hr_HR, hsb, hsb_DE, ia, ia_001, id, id_ID, is, is_IS, it, it_IT, it_SM, it_VA, jgo, jgo_CM, jv, jv_ID, kgp, kgp_BR, kkj, kkj_CM, kl, kl_GL, km, km_KH, ku, ku_TR, lb, lb_LU, ln, ln_AO, ln_CD, ln_CF, ln_CG, lo, lo_LA, lu, lu_CD, mgh, mgh_MZ, mk, mk_MK, ms_BN, ms_ID, mua, mua_CM, nl, nl_AW, nl_BE, nl_BQ, nl_CW, nl_NL, nl_SR, nl_SX, nnh, nnh_CM, pt, pt_BR, qu_BO, rn, rn_BI, ro, ro_MD, ro_RO, rw, rw_RW, sc, sc_IT, seh, seh_MZ, sg, sg_CF, sl, sl_SI, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Cyrl_XK, sr_Latn, sr_Latn_BA, sr_Latn_ME, sr_Latn_RS, sr_Latn_XK, su, su_Latn, su_Latn_ID, sw_CD, tr, tr_CY, tr_TR, vi, vi_VN, wo, wo_SN, yrl, yrl_BR, yrl_CO, yrl_VE |
12,34,567 | brx, brx_IN, en_IN, gu, gu_IN, hi, hi_IN, hi_Latn, hi_Latn_IN, ml, ml_IN, or, or_IN, pa, pa_Guru, pa_Guru_IN, ta, ta_IN, ta_LK, te, te_IN |
1234567 | en_US_POSIX |
1 234 567 | af, af_NA, af_ZA, agq, agq_CM, bas, bas_CM, be, be_BY, bg, bg_BG, br, br_FR, cs, cs_CZ, cv, cv_RU, de_AT, dje, dje_NE, dua, dua_CM, dyo, dyo_SN, en_FI, en_SE, en_ZA, eo, eo_001, es_CR, et, et_EE, ewo, ewo_CM, ff, ff_Latn, ff_Latn_BF, ff_Latn_CM, ff_Latn_GH, ff_Latn_GM, ff_Latn_GN, ff_Latn_GW, ff_Latn_LR, ff_Latn_MR, ff_Latn_NE, ff_Latn_NG, ff_Latn_SL, ff_Latn_SN, fi, fi_FI, fr_CA, hu, hu_HU, hy, hy_AM, ka, ka_GE, kab, kab_DZ, kea, kea_CV, khq, khq_ML, kk, kk_KZ, ksf, ksf_CM, ksh, ksh_DE, ky, ky_KG, lt, lt_LT, lv, lv_LV, mfe, mfe_MU, nb, nb_NO, nb_SJ, nmg, nmg_CM, nn, nn_NO, no, os, os_GE, os_RU, pl, pl_PL, pt_AO, pt_CH, pt_CV, pt_GQ, pt_GW, pt_LU, pt_MO, pt_MZ, pt_PT, pt_ST, pt_TL, ru, ru_BY, ru_KG, ru_KZ, ru_MD, ru_RU, ru_UA, sah, sah_RU, se, se_FI, se_NO, se_SE, ses, ses_ML, shi, shi_Latn, shi_Latn_MA, shi_Tfng, shi_Tfng_MA, sk, sk_SK, smn, smn_FI, sq, sq_AL, sq_MK, sq_XK, sv, sv_AX, sv_FI, sv_SE, tg, tg_TJ, tk, tk_TM, tt, tt_RU, twq, twq_NE, tzm, tzm_MA, uk, uk_UA, uz, uz_Cyrl, uz_Cyrl_UZ, uz_Latn, uz_Latn_UZ, xh, xh_ZA, yav, yav_CM, zgh, zgh_MA |
1’234’567 | de_CH, de_LI, en_CH, gsw, gsw_CH, gsw_FR, gsw_LI, it_CH, rm, rm_CH, wae, wae_CH |
1 234 567 | fr, fr_BE, fr_BF, fr_BI, fr_BJ, fr_BL, fr_CD, fr_CF, fr_CG, fr_CH, fr_CI, fr_CM, fr_DJ, fr_DZ, fr_FR, fr_GA, fr_GF, fr_GN, fr_GP, fr_GQ, fr_HT, fr_KM, fr_MC, fr_MF, fr_MG, fr_ML, fr_MQ, fr_MR, fr_MU, fr_NC, fr_NE, fr_PF, fr_PM, fr_RE, fr_RW, fr_SC, fr_SN, fr_SY, fr_TD, fr_TG, fr_TN, fr_VU, fr_WF, fr_YT |
١٬٢٣٤٬٥٦٧ | ar, ar_001, ar_BH, ar_DJ, ar_EG, ar_ER, ar_IL, ar_IQ, ar_JO, ar_KM, ar_KW, ar_LB, ar_MR, ar_OM, ar_PS, ar_QA, ar_SA, ar_SD, ar_SO, ar_SS, ar_SY, ar_TD, ar_YE, ckb, ckb_IQ, ckb_IR, sd, sd_Arab, sd_Arab_PK |
۱٬۲۳۴٬۵۶۷ | fa, fa_AF, fa_IR, ks, ks_Arab, ks_Arab_IN, lrc, lrc_IQ, lrc_IR, mzn, mzn_IR, pa_Arab, pa_Arab_PK, ps, ps_AF, ps_PK, ur_IN, uz_Arab, uz_Arab_AF |
१,२३४,५६७ | bgc, bgc_IN, bho, bho_IN, raj, raj_IN |
१२,३४,५६७ | mr, mr_IN, ne, ne_IN, ne_NP, sa, sa_IN |
১,২৩৪,৫৬৭ | mni, mni_Beng, mni_Beng_IN |
১২,৩৪,৫৬৭ | as, as_IN, bn, bn_BD, bn_IN |
༡༢,༣༤,༥༦༧ | dz, dz_BT |
၁,၂၃၄,၅၆၇ | my, my_MM |
᱑,᱒᱓᱔,᱕᱖᱗ | sat, sat_Olck, sat_Olck_IN |
𑄷𑄸,𑄹𑄺,𑄻𑄼𑄽 | ccp, ccp_BD, ccp_IN |
𞥑⹁𞥒𞥓𞥔⹁𞥕𞥖𞥗 | ff_Adlm, ff_Adlm_BF, ff_Adlm_CM, ff_Adlm_GH, ff_Adlm_GM, ff_Adlm_GN, ff_Adlm_GW, ff_Adlm_LR, ff_Adlm_MR, ff_Adlm_NE, ff_Adlm_NG, ff_Adlm_SL, ff_Adlm_SN |
Code:
import java.util.*;
import java.util.Map.*;
import java.util.stream.*;
import com.ibm.icu.number.*;
import com.ibm.icu.util.*;
public class IcuSpellout {
public static void main(String... args) {
Map<String, TreeSet<ULocale>> numbers = new TreeMap<>();
for(ULocale l : ULocale.getAvailableLocales()) {
String string = NumberFormatter.withLocale(l).format(1234567).toString();
numbers.computeIfAbsent(string, k -> new TreeSet<>()).add(l);
}
for(Entry<String, TreeSet<ULocale>> entry : numbers.entrySet()) {
System.out.println(entry.getKey() + " | " +
entry.getValue().stream().map(ULocale::toString).collect(Collectors.joining(", ")));
}
}
}
I am not sure if I understand how the spec defines decimal formats in the static context. Is it currently allowed for an implementation to provide default formats (other than the unnamed one) that have not been specified by the user? For example, would it currently be legal for a processor to return a result for format-number(1, '0', 'de')
?
In XQFO 4.0, 4.7.1 Defining a decimal format says:
Decimal formats are defined in the static context, and the way they are defined is therefore outside the scope of this specification. XSLT and XQuery both provide custom syntax for creating a decimal format.
In XQuery 4.0, statically known decimal formats are defined as follows:
This is a mapping from QNames to decimal formats, with one default format that has no visible name, referred to as the unnamed decimal format. Each format is available for use when formatting numbers using the
fn:format-number
function. […]
5.10 Decimal Format Declaration says:
A decimal format declaration adds a decimal format to the statically known decimal formats, which define the properties used to format numbers using the
fn:format-number()
function, as described in XQuery and XPath Functions and Operators 4.0. […]
I’m closing this issue, as the PR was accepted, and the last question has been answered in today’s meeting (https://qt4cg.org/meeting/minutes/2024/03-05.html):
format-number(1, '0', 'de')
.
It would be nice if the decimal format for
fn:format-number
could also be supplied via an additional argument. The current syntax is:The syntax could be enhanced as follows:
If both
decimal-format-name
andformat
are supplied, an error should be raised.Edit 2023-05-02, adopted from a comment further below:
Next, language-specific default settings would be sensible. The existing syntax could be used:
As known from the other functions for formatting numbers and dates, it could be up to the implementation to decide which languages are supported. The defaults could be overwritten by custom decimal-format declarations in the prolog to ensure that a setting is applied, even if an implementation does not support it.