qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

fn:format-number: Specifying decimal format #340

Closed ChristianGruen closed 8 months ago

ChristianGruen commented 1 year ago

It would be nice if the decimal format for fn:format-number could also be supplied via an additional argument. The current syntax is:

(: result: 12.345,67 :)
declare decimal-format de decimal-separator = ',' grouping-separator = '.';

format-number(
  value := 12345.67,
  picture := '#.##0,00',
  decimal-format-name := 'de'
)

The syntax could be enhanced as follows:

format-number(
  value := 12345.67,
  picture := '#.##0,00',
  format := map { 'decimal-separator': ',', 'grouping-separator': '.' }
)

If both decimal-format-name and format are supplied, an error should be raised.

Edit 2023-05-02, adopted from a comment further below:

Next, language-specific default settings would be sensible. The existing syntax could be used:

format-number(12345.67, '#.##0,00', 'de')

As known from the other functions for formatting numbers and dates, it could be up to the implementation to decide which languages are supported. The defaults could be overwritten by custom decimal-format declarations in the prolog to ensure that a setting is applied, even if an implementation does not support it.

michaelhkay commented 1 year ago

Yes, the current mechanism is very clumsy. I think the original intent in XSLT 1.0 was probably to define presentation "at arm's length" so that the logic didn't need to change if the output format changed, but that can be achieved perfectly well by putting the options in a global variable.

ChristianGruen commented 1 year ago

I think that language-specific default settings would be sensible:

format-number(123.45, '#.##0,00', 'de')

As known from the other functions for formatting numbers and dates, it could be up to the implementation to decide which languages are supported. The defaults could be overwritten by custom decimal-format declarations in the prolog to ensure that a setting is applied, even if an implementation does not support it.

michaelhkay commented 1 year ago

I'm not convinced this would give good interoperability. Consider Arabic for example: should it default to using western or eastern decimal digits? Both are in widespread use, and the idea that everyone with a particular (country, language) combination uses the same conventions is fundamentally misguided. This doesn't matter too much if it merely affects the format of the output, but it does matter if it makes a picture string valid in one implementation and invalid in another.

ChristianGruen commented 1 year ago

I agree, there are cases which are easier to handle and others are more sophisticated. I think the same is true for formatting integers and dates: The rules are rich and sophisticated, but for more advanced use cases (such as spelling out correct hiragana for numbers with Japanese counter words, or considering declension of numerals in Russian), you’ll be lost without writing custom code.

With ICU and Java, it’s fairly straightforward to choose language-specific formatting rules. I haven’t checked if there are flags to e.g. control formatting for Arabic numbers, and it could be that ICU has really taken the wrong path. From a German perspective, though, it’s restrictive that an implementation cannot provide sane defaults for Non-English users.

This is how ICU formats integers with different locales:

Result Locales
1,234,567 ak, ak_GH, am, am_ET, ar_AE, ar_EH, asa, asa_TZ, bem, bem_ZM, bez, bez_TZ, bm, bm_ML, bo, bo_CN, bo_IN, ce, ce_RU, ceb, ceb_PH, cgg, cgg_UG, chr, chr_US, cy, cy_GB, dav, dav_KE, doi, doi_IN, ebu, ebu_KE, ee, ee_GH, ee_TG, en, en_001, en_150, en_AE, en_AG, en_AI, en_AS, en_AU, en_BB, en_BI, en_BM, en_BS, en_BW, en_BZ, en_CA, en_CC, en_CK, en_CM, en_CX, en_CY, en_DG, en_DM, en_ER, en_FJ, en_FK, en_FM, en_GB, en_GD, en_GG, en_GH, en_GI, en_GM, en_GU, en_GY, en_HK, en_IE, en_IL, en_IM, en_IO, en_JE, en_JM, en_KE, en_KI, en_KN, en_KY, en_LC, en_LR, en_LS, en_MG, en_MH, en_MO, en_MP, en_MS, en_MT, en_MU, en_MV, en_MW, en_MY, en_NA, en_NF, en_NG, en_NR, en_NU, en_NZ, en_PG, en_PH, en_PK, en_PN, en_PR, en_PW, en_RW, en_SB, en_SC, en_SD, en_SG, en_SH, en_SL, en_SS, en_SX, en_SZ, en_TC, en_TK, en_TO, en_TT, en_TV, en_TZ, en_UG, en_UM, en_US, en_VC, en_VG, en_VI, en_VU, en_WS, en_ZM, en_ZW, es_419, es_BR, es_BZ, es_CU, es_DO, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_SV, es_US, fil, fil_PH, ga, ga_GB, ga_IE, gd, gd_GB, guz, guz_KE, gv, gv_IM, ha, ha_GH, ha_NE, ha_NG, haw, haw_US, he, he_IL, ig, ig_NG, ii, ii_CN, ja, ja_JP, jmc, jmc_TZ, kam, kam_KE, kde, kde_TZ, ki, ki_KE, kln, kln_KE, kn, kn_IN, ko, ko_KP, ko_KR, kok, kok_IN, ks_Deva, ks_Deva_IN, ksb, ksb_TZ, kw, kw_GB, lag, lag_TZ, lg, lg_UG, lkt, lkt_US, luo, luo_KE, luy, luy_KE, mai, mai_IN, mas, mas_KE, mas_TZ, mer, mer_KE, mg, mg_MG, mgo, mgo_CM, mi, mi_NZ, mn, mn_MN, ms, ms_MY, ms_SG, mt, mt_MT, naq, naq_NA, nd, nd_ZW, nus, nus_SS, nyn, nyn_UG, om, om_ET, om_KE, pcm, pcm_NG, qu, qu_EC, qu_PE, rof, rof_TZ, rwk, rwk_TZ, saq, saq_KE, sbp, sbp_TZ, sd_Deva, sd_Deva_IN, si, si_LK, sn, sn_ZW, so, so_DJ, so_ET, so_KE, so_SO, sw, sw_KE, sw_TZ, sw_UG, ta_MY, ta_SG, teo, teo_KE, teo_UG, th, th_TH, ti, ti_ER, ti_ET, to, to_TO, ug, ug_CN, ur, ur_PK, vai, vai_Latn, vai_Latn_LR, vai_Vaii, vai_Vaii_LR, vun, vun_TZ, xog, xog_UG, yi, yi_001, yo, yo_BJ, yo_NG, yue, yue_Hans, yue_Hans_CN, yue_Hant, yue_Hant_HK, zh, zh_Hans, zh_Hans_CN, zh_Hans_HK, zh_Hans_MO, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu, zu_ZA
1.234.567 ar_DZ, ar_LY, ar_MA, ar_TN, ast, ast_ES, az, az_Cyrl, az_Cyrl_AZ, az_Latn, az_Latn_AZ, bs, bs_Cyrl, bs_Cyrl_BA, bs_Latn, bs_Latn_BA, ca, ca_AD, ca_ES, ca_FR, ca_IT, da, da_DK, da_GL, de, de_BE, de_DE, de_IT, de_LU, dsb, dsb_DE, el, el_CY, el_GR, en_AT, en_BE, en_DE, en_DK, en_NL, en_SI, es, es_AR, es_BO, es_CL, es_CO, es_EA, es_EC, es_ES, es_GQ, es_IC, es_PH, es_PY, es_UY, es_VE, eu, eu_ES, fo, fo_DK, fo_FO, fr_LU, fr_MA, fur, fur_IT, fy, fy_NL, gl, gl_ES, hr, hr_BA, hr_HR, hsb, hsb_DE, ia, ia_001, id, id_ID, is, is_IS, it, it_IT, it_SM, it_VA, jgo, jgo_CM, jv, jv_ID, kgp, kgp_BR, kkj, kkj_CM, kl, kl_GL, km, km_KH, ku, ku_TR, lb, lb_LU, ln, ln_AO, ln_CD, ln_CF, ln_CG, lo, lo_LA, lu, lu_CD, mgh, mgh_MZ, mk, mk_MK, ms_BN, ms_ID, mua, mua_CM, nl, nl_AW, nl_BE, nl_BQ, nl_CW, nl_NL, nl_SR, nl_SX, nnh, nnh_CM, pt, pt_BR, qu_BO, rn, rn_BI, ro, ro_MD, ro_RO, rw, rw_RW, sc, sc_IT, seh, seh_MZ, sg, sg_CF, sl, sl_SI, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Cyrl_XK, sr_Latn, sr_Latn_BA, sr_Latn_ME, sr_Latn_RS, sr_Latn_XK, su, su_Latn, su_Latn_ID, sw_CD, tr, tr_CY, tr_TR, vi, vi_VN, wo, wo_SN, yrl, yrl_BR, yrl_CO, yrl_VE
12,34,567 brx, brx_IN, en_IN, gu, gu_IN, hi, hi_IN, hi_Latn, hi_Latn_IN, ml, ml_IN, or, or_IN, pa, pa_Guru, pa_Guru_IN, ta, ta_IN, ta_LK, te, te_IN
1234567 en_US_POSIX
1 234 567 af, af_NA, af_ZA, agq, agq_CM, bas, bas_CM, be, be_BY, bg, bg_BG, br, br_FR, cs, cs_CZ, cv, cv_RU, de_AT, dje, dje_NE, dua, dua_CM, dyo, dyo_SN, en_FI, en_SE, en_ZA, eo, eo_001, es_CR, et, et_EE, ewo, ewo_CM, ff, ff_Latn, ff_Latn_BF, ff_Latn_CM, ff_Latn_GH, ff_Latn_GM, ff_Latn_GN, ff_Latn_GW, ff_Latn_LR, ff_Latn_MR, ff_Latn_NE, ff_Latn_NG, ff_Latn_SL, ff_Latn_SN, fi, fi_FI, fr_CA, hu, hu_HU, hy, hy_AM, ka, ka_GE, kab, kab_DZ, kea, kea_CV, khq, khq_ML, kk, kk_KZ, ksf, ksf_CM, ksh, ksh_DE, ky, ky_KG, lt, lt_LT, lv, lv_LV, mfe, mfe_MU, nb, nb_NO, nb_SJ, nmg, nmg_CM, nn, nn_NO, no, os, os_GE, os_RU, pl, pl_PL, pt_AO, pt_CH, pt_CV, pt_GQ, pt_GW, pt_LU, pt_MO, pt_MZ, pt_PT, pt_ST, pt_TL, ru, ru_BY, ru_KG, ru_KZ, ru_MD, ru_RU, ru_UA, sah, sah_RU, se, se_FI, se_NO, se_SE, ses, ses_ML, shi, shi_Latn, shi_Latn_MA, shi_Tfng, shi_Tfng_MA, sk, sk_SK, smn, smn_FI, sq, sq_AL, sq_MK, sq_XK, sv, sv_AX, sv_FI, sv_SE, tg, tg_TJ, tk, tk_TM, tt, tt_RU, twq, twq_NE, tzm, tzm_MA, uk, uk_UA, uz, uz_Cyrl, uz_Cyrl_UZ, uz_Latn, uz_Latn_UZ, xh, xh_ZA, yav, yav_CM, zgh, zgh_MA
1’234’567 de_CH, de_LI, en_CH, gsw, gsw_CH, gsw_FR, gsw_LI, it_CH, rm, rm_CH, wae, wae_CH
1 234 567 fr, fr_BE, fr_BF, fr_BI, fr_BJ, fr_BL, fr_CD, fr_CF, fr_CG, fr_CH, fr_CI, fr_CM, fr_DJ, fr_DZ, fr_FR, fr_GA, fr_GF, fr_GN, fr_GP, fr_GQ, fr_HT, fr_KM, fr_MC, fr_MF, fr_MG, fr_ML, fr_MQ, fr_MR, fr_MU, fr_NC, fr_NE, fr_PF, fr_PM, fr_RE, fr_RW, fr_SC, fr_SN, fr_SY, fr_TD, fr_TG, fr_TN, fr_VU, fr_WF, fr_YT
١٬٢٣٤٬٥٦٧ ar, ar_001, ar_BH, ar_DJ, ar_EG, ar_ER, ar_IL, ar_IQ, ar_JO, ar_KM, ar_KW, ar_LB, ar_MR, ar_OM, ar_PS, ar_QA, ar_SA, ar_SD, ar_SO, ar_SS, ar_SY, ar_TD, ar_YE, ckb, ckb_IQ, ckb_IR, sd, sd_Arab, sd_Arab_PK
۱٬۲۳۴٬۵۶۷ fa, fa_AF, fa_IR, ks, ks_Arab, ks_Arab_IN, lrc, lrc_IQ, lrc_IR, mzn, mzn_IR, pa_Arab, pa_Arab_PK, ps, ps_AF, ps_PK, ur_IN, uz_Arab, uz_Arab_AF
१,२३४,५६७ bgc, bgc_IN, bho, bho_IN, raj, raj_IN
१२,३४,५६७ mr, mr_IN, ne, ne_IN, ne_NP, sa, sa_IN
১,২৩৪,৫৬৭ mni, mni_Beng, mni_Beng_IN
১২,৩৪,৫৬৭ as, as_IN, bn, bn_BD, bn_IN
༡༢,༣༤,༥༦༧ dz, dz_BT
၁,၂၃၄,၅၆၇ my, my_MM
᱑,᱒᱓᱔,᱕᱖᱗ sat, sat_Olck, sat_Olck_IN
𑄷𑄸,𑄹𑄺,𑄻𑄼𑄽 ccp, ccp_BD, ccp_IN
𞥑⹁𞥒𞥓𞥔⹁𞥕𞥖𞥗 ff_Adlm, ff_Adlm_BF, ff_Adlm_CM, ff_Adlm_GH, ff_Adlm_GM, ff_Adlm_GN, ff_Adlm_GW, ff_Adlm_LR, ff_Adlm_MR, ff_Adlm_NE, ff_Adlm_NG, ff_Adlm_SL, ff_Adlm_SN

Code:

import java.util.*;
import java.util.Map.*;
import java.util.stream.*;
import com.ibm.icu.number.*;
import com.ibm.icu.util.*;

public class IcuSpellout {
  public static void main(String... args) {
    Map<String, TreeSet<ULocale>> numbers = new TreeMap<>();

    for(ULocale l : ULocale.getAvailableLocales()) {
      String string = NumberFormatter.withLocale(l).format(1234567).toString();
      numbers.computeIfAbsent(string, k -> new TreeSet<>()).add(l);
    }
    for(Entry<String, TreeSet<ULocale>> entry : numbers.entrySet()) {
      System.out.println(entry.getKey() + " | " +
          entry.getValue().stream().map(ULocale::toString).collect(Collectors.joining(", ")));
    }
  }
}
ChristianGruen commented 8 months ago

I am not sure if I understand how the spec defines decimal formats in the static context. Is it currently allowed for an implementation to provide default formats (other than the unnamed one) that have not been specified by the user? For example, would it currently be legal for a processor to return a result for format-number(1, '0', 'de')?

In XQFO 4.0, 4.7.1 Defining a decimal format says:

Decimal formats are defined in the static context, and the way they are defined is therefore outside the scope of this specification. XSLT and XQuery both provide custom syntax for creating a decimal format.

In XQuery 4.0, statically known decimal formats are defined as follows:

This is a mapping from QNames to decimal formats, with one default format that has no visible name, referred to as the unnamed decimal format. Each format is available for use when formatting numbers using the fn:format-number function. […]

5.10 Decimal Format Declaration says:

A decimal format declaration adds a decimal format to the statically known decimal formats, which define the properties used to format numbers using the fn:format-number() function, as described in XQuery and XPath Functions and Operators 4.0. […]

ChristianGruen commented 8 months ago

I’m closing this issue, as the PR was accepted, and the last question has been answered in today’s meeting (https://qt4cg.org/meeting/minutes/2024/03-05.html):