unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 175 forks source link

`NumberFormatter` & co. scope (unified vs modular) #275

Open zbraniecki opened 4 years ago

zbraniecki commented 4 years ago

In ICU (and ECMA402) NumberFormat becomes the jack of all trades with formatting for numbers, currencies, measuring units, and so on. There's even a drive from Shane to incorporate Pluralization as a feature of a NumberFormat.

Shane justified it by saying that all number formatters take similar options and similar operations to tailor the data.

The cost of such approach is that it becomes trickier to modularize such crates and NumberFormat becomes actually a pretty large codebase on a very fundamental level that is required by basically everything and fragile DCE is the only hope to keep the overhead lower.

It is my impression that in ICU4X context we can bring that modularity back, and I believe Shane is intending to separate the part that operates on the number (like rounding, tailoring etc.) into FixedDecimal and similar helper structs and effectively removing the benefit of clustering all types of numerical operations in a single formatter. @sffc can you share your take on this if I misunderstood you?

if that hypothesis is correct we have a way to get modular and lean CurrencyFormatter, MeasureUnitFormatter, RuleBasedNumberFormatter, RelativeTimeFormatter, DurationFormatter and so on into their own components and keep each one small without paying with an overhead when all of them are in use.

This issue has been filled to discuss that and verify if we're all on the same page about how we want to tackle that topic.

sffc commented 4 years ago

I am of the opinion that crates are not the most effective way to go about modularization. I have written in wrapper-layer.md that I believe we can use dead code elimination to achieve modularization in a much more effective way.

sffc commented 4 years ago

I believe Shane is intending to separate the part that operates on the number (like rounding, tailoring etc.) into FixedDecimal and similar helper structs and effectively removing the benefit of clustering all types of numerical operations in a single formatter. @sffc can you share your take on this if I misunderstood you?

FixedDecimal is intended as a type that preserves leading and trailing zeros on input and output of NumberFormat, which is an important feature we largely lack in 402. Rounding operations cannot be split from NumberFormat because rounding depends on locale data for currencies, compact decimals, and measurement units.

sffc commented 3 years ago

2020-12-04 discussion:

sffc commented 3 years ago

More specifically, here is how I see the breakdown of features going into FixedDecimalFormat (lower level) versus KitchenSinkNumberFormat (higher level):

FixedDecimalFormat

What: Pass-through formatter for FixedDecimal, applying localized symbols but no arithmetic.

Features:

* Sign display is slightly more complex, due to the requirement that we add affixes to the number. It may be slightly smaller if FixedDecimalFormat were "positive only", not capable of outputting a sign.

** Depends on the chosen design of #228

KitchenSinkNumberFormat

What: A larger, data-driven formatter supporting a larger set of UTS 35

Features:

* "Currency" encompasses currency spacing rules, currency rounding, symbol resolution, etc.

Note on Rounding

Rounding is a big chunk of the logic in ICU NumberFormatter. Unfortunately, it needs to be coupled with at least KitchenSinkNumberFormat, because the algorithm for selecting a compact form and applying a currency both require rounding the number based on locale data.

sffc commented 2 years ago

I filed #1441 to track currency formatting.

In terms of class structure / modularity: there are 2 main dimensions:

  1. Notation
    • Decimal
    • Compact Decimal
    • Scientific
    • Spellout (RBNF) -- not yet supported in ECMA-402, but we want to get here
  2. Unit
    • No Unit
    • Currency
    • Percentage
    • Measurement

These are the two main dimensions we need to solve. The challenge is that these two dimensions can be combined freely, and when doing so, we may need to load different data or use different code paths.

For example:

Unit \ Notation Decimal Compact Scientific Spellout
None 1000 1K 1E3 one thousand
Currency $1000.00 $1K $1.00E3 one thousand dollars
Percent 1000% 1K% 1E3% one thousand percent
Measure 1000 m 1K m 1E3 m one thousand meters

Within each box, there may be multiple display options as well, most often long/short/narrow.

Clearly there are some formats in this table that make more sense than others. But, we need to think about how to scale up to support this grid.

CC @robertbastian