python-babel / babel

The official repository for Babel, the Python Internationalization Library
http://babel.pocoo.org/
BSD 3-Clause "New" or "Revised" License
1.29k stars 433 forks source link

Add support of local numbering systems for number symbols #1036

Closed kajte closed 7 months ago

kajte commented 8 months ago

Changes:

Example: Before:

>>> numbers.format_decimal(123456789.01234, locale="fa_IR")
'123,456,789.012'

Now:

numbers.format_decimal(123456789.01234, locale="fa_IR")
'123٬456٬789٫012'

Fixes partially issue:https://github.com/python-babel/babel/issues/446 I took some inspiration from the closed PR: https://github.com/python-babel/babel/pull/470

Couple questions:

I would eventually like to fix other # TODO: Support other number systems TODOs from the code and take default numbering system to use in percentage, currency etc. formatting and add a function for formatting digits to the local numbering system but I decided to create a pull request at this point to check the direction I'm taking is ok for maintainers. I feel even features added in this PR are an enhancement as such.

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5dff607) 91.34% compared to head (abe6bda) 91.43%. Report is 1 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1036 +/- ## ========================================== + Coverage 91.34% 91.43% +0.08% ========================================== Files 26 26 Lines 4415 4436 +21 ========================================== + Hits 4033 4056 +23 + Misses 382 380 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

akx commented 8 months ago

Should it also convert the digits like this numbers.format_decimal(1.2, locale="ar", numbering_system='default') == '٢٫٣'?

I mean, if that's a sensible formatting, I think so, yeah. I do wonder if we should do one version where the high-level formatting functions still use "latn" and emit e.g. a FutureWarning when the default for the locale (e.g. "ar") would not be Latin, like

FutureWarning: the default numbering system for the `"ar"` locale is ...; to keep Latin formatting, pass `numbering_system='latn'`

? I'm just kind of worried about putting people through the trouble of their formatting suddenly changing with a "routine" Babel upgrade.

kajte commented 8 months ago

Stats about numbering systems regarding symbols I did some statistics regarding availability of number symbols (I'll paste code used for calculations to the end of message):

Conclusion: There's no need to have traditional alias. Also I feel native alias isn't that useful and we could skip it if not having it makes code simpler but it's your call.


Next steps on this PR

Could we preliminarily agree on sensible signature for functions so I can't start implementing it so there's less risk I make a lot of code we don't agree on?

One option could be to add something like numbering_system_config: Literal["default"] | str = "latn" to all relevant functions defaulting to latn. On later versions the FutureWarning can be added as you suggested.

It would be nicer if it don't have to call _determine_numbering_system on every stage and for lower level functions the argument could be only numbering_system: str = "latn". However, as far as I understand the library structure, most of functions in numbers module (and all on the example below) are public and documented so someone might be using them directly and then the ability to pass "default" is probably useful. Correct me if I'm wrong on this.

What do you think? Do you have more suitable suggestions?

# new function
def _determine_numbering_system(locale: Locale, numbering_system: Literal["default"] | str = "latn") -> str:
    if numbering_system == "default":
        return locale.default_numbering_system
    else:
        return numbering_system

#existing but numbering_system_config argument added
def format_decimal(
    number: float | decimal.Decimal | str,
    format: str | NumberPattern | None = None,
    locale: Locale | str | None = LC_NUMERIC,
    decimal_quantization: bool = True,
    group_separator: bool = True,
    *,
    numbering_system_config: Literal["default"] | str = "latn"
) -> str:
    locale = Locale.parse(locale)
    numbering_system = _determine_numbering_system(locale, numbering_system_config)
    ...
    NumberPattern.apply(..., numbering_system_config=numbering_system)

NumberPattern():
#existing but numbering_system_config argument added
     def apply(
        self,
        value: float | decimal.Decimal | str,
        locale: Locale | str | None,
        currency: str | None = None,
        currency_digits: bool = True,
        decimal_quantization: bool = True,
        force_frac: tuple[int, int] | None = None,
        group_separator: bool = True,
        *,
        numbering_system_config: Literal["default"] | str = "latn"
    ):
         numbering_system = _determine_numbering_system(locale, numbering_system_config)
         ...
          symbol = get_decimal_symbol(locale, numbering_system_config=numbering_system)
          ....

#existing but numbering_system_config argument and modified to use number system for getting correct symbol
def get_decimal_symbol(
        locale: Locale | str | None = LC_NUMERIC,
        *,
        numbering_system_config: Literal["default"] | str = "latn"
) -> str:
    parsed_locale = Locale.parse(locale)
    numbering_system = _determine_numbering_system(locale, numbering_system_config)
    return parsed_locale.number_symbols[numbering_system].get('decimal', '.')

Other thoughts I was chatting with my friend who is native speaker of bn. For that language it seems eg. decimal delimiters are same for bo latn and beng numering systems so it didn't help for the question whether mixing local symbols and latn numbers looks weird. However, I learned from him that converting digits from latn to beng is just doing 1-to-1 mapping from 0123456789 to ০১২৩৪৫৬৭৮৯ (https://github.com/unicode-org/cldr/blob/main/common/supplemental/numberingSystems.xml#L18). Also, he said they use quite often also latn numbers.

I think it would be quite easy to validate digit formatting against eg. the java/kotlin number formatter from the standard library to get more confidence (or to findout there's more nuances) that converting digits is as simple as 1-to-1 mapping for all number systems having 10 digits.


Here's code that I used to get stats on the beginning of post. Basically calculating number of failures or successes depending on the case. Just for the reference.

def test_latn_available(locale):
    assert Locale.parse(locale).number_symbols["latn"]

@pytest.mark.all_locales
def test_default_available(locale):
    locale = Locale.parse(locale)
    assert locale.number_symbols[locale.default_numbering_system]

@pytest.mark.all_locales
def test_locale_have_native(locale):
    assert Locale.parse(locale).other_numbering_systems["native"]

@pytest.mark.all_locales
def test_native_locale_have_symbols(locale):
    locale = Locale.parse(locale)
    assert locale.number_symbols[locale.other_numbering_systems["native"]]

@pytest.mark.all_locales
def test_locale_have_traditional(locale):
    assert Locale.parse(locale).other_numbering_systems["traditional"]

@pytest.mark.all_locales
def test_traditional_locale_have_symbols(locale):
    locale = Locale.parse(locale)
    assert locale.number_symbols[locale.other_numbering_systems["traditional"]]

@pytest.mark.all_locales
def test_native_and_default_same_if_native_isnt_latin(locale):
    locale = Locale.parse(locale)
    native = locale.other_numbering_systems["native"]
    default = locale.default_numbering_system

    if native != "latn":
        assert native == default, f"{default} {native} {locale}"
akx commented 8 months ago

Sorry for the delay, had a bunch of other things to attend to...

One option could be to add something like numbering_system_config: Literal["default"] | str = "latn" to all relevant functions defaulting to latn. On later versions the FutureWarning can be added as you suggested.

Sounds good, but let's call the argument numbering_system instead, since it's kwarg-only so numbering_system_config= would be littering user code with an extra _config word :)

It would be nicer if it don't have to call _determine_numbering_system on every stage and for lower level functions the argument could be only numbering_system: str = "latn". However, as far as I understand the library structure, most of functions in numbers module (and all on the example below) are public and documented so someone might be using them directly and then the ability to pass "default" is probably useful.

That is the case, yep. That's somewhat analogous to how you can pass in a locale identifier to almost anything, and we call Locale.parse on it.

The API surface looks good to me at first glance aside from that naming change :)

akx commented 7 months ago

@kajte I'm planning on trying to get a new Babel out maybe this week or early next week, and I think we could at least include the numbering-system loading code from this PR, even if there'd be no surface changes that'd allow to use them just yet, if you have the time to crop this PR to only that? IOW, pretty much just change the get_*_symbol() code to always use "latn" for the time being.

kajte commented 7 months ago

@kajte I'm planning on trying to get a new Babel out maybe this week or early next week, and I think we could at least include the numbering-system loading code from this PR, even if there'd be no surface changes that'd allow to use them just yet, if you have the time to crop this PR to only that? IOW, pretty much just change the get_*_symbol() code to always use "latn" for the time being.

I wrote some code last week that I didn't have time to complete yet. I can check today or tomorrow whether it's close to done. If not, your suggestions sounds good.

kajte commented 7 months ago

Okay, I got something done now.

  1. Loads number symbols for available numbering systems from cldr data
  2. All number formatting functions using number symbol data should now take argument numbering_system: Literal["default"] | str = "latn" which is used to select from which numbering system symbols are used. I hope I didn't miss any function or forgot to test any function where I added the numbering_system argument

How does it look? If needed I can also change it what you suggested on Wednesday so that we only want to include that part to the next release.

akx commented 7 months ago

@kajte Thanks! Can you rebase this on master and take care of the lint errors? Beyond the ones currently flagged by Ruff, there will be new ones since #1045 was just merged. I'll give the rest a review too now.

kajte commented 7 months ago

@kajte Thanks! Can you rebase this on master and take care of the lint errors? Beyond the ones currently flagged by Ruff, there will be new ones since #1045 was just merged. I'll give the rest a review too now.

Rebased and at least pre-commit hook succeed locally now.

akx commented 7 months ago

@kajte Thanks! 🎉

kajte commented 7 months ago

How should I run linters locally to catch these? I installed the pre-commit hook and it passed for my commit.

kajte commented 7 months ago

@kajte Thanks! 🎉

Thanks for your support! I have couple busy weeks ahead but I'll try continue with formatting actual numerals after that.

akx commented 7 months ago

How should I run linters locally to catch these? I installed the pre-commit hook and it passed for my commit.

I think you hadn't rebased on the master that had the augmented Ruff configuration :)

Thanks for your support! I have couple busy weeks ahead but I'll try continue with formatting actual numerals after that.

Thank you for the contribution! Same here TBH, and then it'll be kinkunpaistoaika and – – 😁