Number, date, time, duration formatters

rsheptolut commented 6 years ago

Hi!

How about other formatters supported by messageformat.net and defined by the spec? Especially number, because otherwise there's no way for plural to work as it is supposed to, since now it just spits out the number "as is" in the Invariant Culture, without respect to the regional decimal separator and without any thousand separators.

Although I've looked at https://github.com/andyearnshaw/Intl.js/blob/master/src/11.numberformat.js and it looks pretty hopeless to faithfully translate that into C#.

jeffijoe commented 6 years ago

Would it help to be able to pass in a culture?

rsheptolut commented 6 years ago

@jeffijoe we're already passing the culture when constructing MessageFormatter, this should be enough, shouldn't it? I'm thinking more like, implement the number formatter as a simple proxy to C# facilities to format numbers. This thing would just use the culture already in MessageFormatter instance to create a CultureInfo and format a number according to the rules, whatever they are. Wouldn't be very up to spec of ICU MessageFormat, probably, but the easiest way to get the job done.

rsheptolut commented 6 years ago

And same thing with date, time and duration formatters, that are defined in messageformat.js but not implemented here. Just do a simple translator to call .NET Framework methods, basically what messageformat.js does, calling Intl for the most part.

jeffijoe commented 6 years ago

Oh, I forgot about that! That was added in #14.

Would you be willing to submit a PR?

rsheptolut commented 6 years ago

@jeffijoe I will if I decide to commit to messageformat.net for my project. The other thing that scares me is that I'll also have to port PluralRules from messageformat.js for other locales. That's a lot of locales!

jeffijoe commented 6 years ago

Locales are busywork so that's why I didn't bother. The most important part for me was the pluralisation constructs

glen-84 commented 1 year ago

Are there really no formatters for numbers, dates, etc.? 😞

Basically, it should be able to do anything that MessageFormat.js can do.

Or not? 🙃

Argument formatting is a core part of the standard, and it's supported by JavaScript libraries, PHP, Java, etc.

jeffijoe commented 1 year ago

Are there really no formatters for numbers, dates, etc.? 😞

Basically, it should be able to do anything that MessageFormat.js can do.

That was written at a time where plural and select were everything the JS version supported. But most of this is already something .NET supports natively.

No promises, but I can look into it if I get some time.

jeffijoe commented 1 year ago

@glen-84 from the page you linked, they even recommend pre-formatting arguments.

glen-84 commented 1 year ago

@jeffijoe

It's the last of three recommended methods for the argument style, not the type.

{0, number, integer}
    ^ type  ^ style

The predefined styles are the most important, IMO.

jeffijoe commented 1 year ago

@glen-84 looking at the skeletons, too. It looks like ICU4C and ICU4J use different formatting codes, right? Like j in C for hour vs h in Java? If we support skeletons, we would just forward it to the Dotnet datetime formatter.

What would short, medium, long, full be equivalent to for numbers and date/time? Are date and time different formatters? Is percent just adding a % at the end of the number, or is there more to it?

glen-84 commented 1 year ago

@glen-84 looking at the skeletons, too. It looks like ICU4C and ICU4J use different formatting codes, right? Like j in C for hour vs h in Java? If we support skeletons, we would just forward it to the Dotnet datetime formatter.

I think the j is part of ICU: http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

From that page:

Input skeleton symbol It must not occur in pattern or skeleton data. Instead, it is reserved for use in skeletons passed to APIs doing flexible date pattern generation. In such a context, it requests the preferred hour format for the locale (h, H, K, or k), as determined by the preferred attribute of the hours element in supplemental data. In the implementation of such an API, 'j' must be replaced by h, H, K, or k before beginning a match against availableFormats data. Note that use of 'j' in a skeleton passed to an API is the only way to have a skeleton request a locale's preferred time cycle type (12-hour or 24-hour).

I would assume that the symbols should be consistent across languages, for portability of the messages.

If that's true, there may need to be some form of translation between symbols, unless it's somehow possible to make use of the ICU data built into .NET.

It looks like skeletons can get quite complicated, so it may make sense to support predefined styles first, and then look at how skeletons could be supported.

It would likely also be okay to only support a subset of symbols, at least initially. FormatJS does this, due to limitations in ECMA402's Intl API.

What would short, medium, long, full be equivalent to for numbers and date/time?

Examples when using FormatJS, and the en locale.

Dates: (Date.Now)

{value, date} -> 12/23/2022 {value, date, short} -> 12/23/22 {value, date, medium} -> Dec 23, 2022 {value, date, long} -> December 23, 2022 {value, date, full} -> Friday, December 23, 2022

Numbers: (1234567)

{value, number} -> 1,234,567 {value, number, integer} -> 1,234,567 {value, number, ::currency/USD} -> $1,234,567.00 {value, number, percent} -> 123,456,700%

For numbers, string formatting may get close (N0, C, P0), but for dates it more likely requires ICU data.

Are date and time different formatters?

They are different types, yes.

Is percent just adding a % at the end of the number, or is there more to it?

That depends on the locale.

I think the next step would be to see what ICU-related APIs are available in .NET.

NightOwl888 commented 1 year ago

Just FYI, I am currently in the process of porting RuleBasedNumberFormat line-by-line from Java in ICU4N. It is still very much a work in progress and there are no plans at present to build it up to the point where MessageFormat is fully supported, but it is being factored in to how the pieces fit together. Given the complexity of requirements for this, rather than taking it all back to requirements I am hoping to port it line-by-line and gain enough understanding of the technical details and how much of the spec .NET actually supports in order to refactor the DecimalFormat and RuleBasedNumberFormat into something more .NET-like.

Last year, I also worked on building Java parsing/formatting functionality into .NET in J2N.

This goes beyond business requirements. In .NET, the parsers and formatters are done in a way where the state is marshalled over to the current thread/async task so there doesn't need to be any thread synchronization. They also use a lot of low-level optimizations like pointers, Span<char>, and a dedicated buffer on the current thread to make a subset of the spec really fast. In Java, there are tons of allocations and there is some thread synchronization added because each object manages its own data rather than keeping the data aligned with the current thread. I haven't run any benchmarks yet, but I suspect the .NET approach is at least 3x faster and will scale far better than how it was done in ICU4J due to the extra overhead of each object managing its own data and settings instead of sharing these across threads.

.NET only supports ASCII digits, where ICU4J supports digits for all cultures on certain parts of the string, such as exponent.

.NET only supports a subset of the features in the spec, but it is generally enough for most applications and it is done in a way that will work at scale.

I think there is plenty of room in the .NET ecosystem for both messageformat.net and ICU4N to exist. IMO, it would be best if ICU4N takes care of the more advanced features and messageformat.net is kept as a lightweight alternative for those who only want to extend formatting in .NET. Although I don't object if you want to pull data out of the CLDR to provide additional features, to me it would make more sense from both a performance and maintainability perspective if you simply stick with the features that .NET already supports, where possible. That being said, it does make sense to pool our knowledge in order to see how much of the spec we have and how much we are missing as well as how to map feature per feature from ICU to .NET.

I have just completed porting the Currency class from ICU4J, which is a dependency of ICU4J's DecimalFormat class. This is what supplies the JPY, USD, etc. currency codes to the formatter. .NET doesn't expose this data, but there is a wrinkle with it that I didn't expect - the currency codes are dependent on date and furthermore there can also be more than one currency code in use in a different culture. They are supplied in order of precedence. And this is just one small bit of the equation of currency parsing/formatting that is just a small bit of formatting numbers into strings.

The settings are also complicated by the fact that they are non-orthogonal so it is sometimes difficult to understand which settings apply in which circumstances. For example, specifying the NumberFormatInfo.NumberDecimalDigits option only works in combination with the F or N formats. ICU's solution was to build a fluent API to channel users through the settings, taking away settings that are no longer valid. While I agree with this in principle, there really ought to be an API to use like the one in .NET for those who don't want to deal with extra allocations associated with a fluent API or at least make it so the settings derived from the fluent API can be cached and passed onto the formatter at runtime.

Duration format - do note this exists in .NET on the TimeSpan class.

I haven't yet looked at it to see whether there are gaps that .NET doesn't support that ICU does, as this isn't the primary focus at present.

Date format mapping - we have done a simple mapping in NumberDateFormat. However, it is probably wrong to assume that all dates should go before times separated by a space.

You may also find this document helpful: https://unicode.org/reports/tr35/tr35-numbers.html.

glen-84 commented 1 year ago

Thanks for your input @NightOwl888.

It's really unfortunate that ICU APIs are not part of the framework.

It doesn't seem practical to implement this "from scratch", so maybe this library should just add basic support using .NET format strings, and perhaps clarify in the README that the behaviour is not 1-to-1 with other ICU libraries.

If ICU4N ever reaches a point where number, date, and other formatting is fully supported, messageformat.net could consider making use of that functionality to better align with ICU standards.

NightOwl888 commented 1 year ago

It's really unfortunate that ICU APIs are not part of the framework.

Or fortunate, depending on how you look at it. The ICU code isn't always the most efficient and the fact that Microsoft made a high-performance formatter instead of trying to pipe stuff back to the C++ version (and deal with the threading issues) is something we all benefit from.

It doesn't seem practical to implement this "from scratch", so maybe this library should just add basic support using .NET format strings, and perhaps clarify in the README that the behaviour is not 1-to-1 with other ICU libraries.

Well, messageformat.net is a bit more "from scratch" than the direction that ICU4N is heading, being that messageformat.net uses the pluralization data from the CLDR. I wasn't trying to discourage anyone from adding the date, time, and duration features, but was just trying to be helpful on how much work is involved in doing such a thing.

If ICU4N ever reaches a point where number, date, and other formatting is fully supported, messageformat.net could consider making use of that functionality to better align with ICU standards.

Seems a bit strange to do it that way, since messageformat.net is tiny and ICU4N has ~20MB of resource files (the next release will put them into satellite assemblies). IMO, adding the extra features to messageformat.net would be good since they can be combined more easily to build the formatted output.

As for ICU4N, I am still debating how to deal with formatting. .NET provides zero support for extending parsers, and custom formatters don't let you format any date or number types (they are hard coded to ask for a NumberFormatInfo or DateTimeFormatInfo, which are both sealed). And I have learned by porting the documentation comments that the design of MessageFormat is over 20 years old. It really could have been a lot better if it had been designed after generics were a thing.

The way forward in ICU is apparently using fluent APIs all the way (there is a preview of MessageFormatter in the current version which does just that).

Of course, this design still breaks a core design principle of .NET - never store CultureInfo in a field! Otherwise, if you specify CultureInfo.CurrentCulture you will be in for a surprise when your message is formatted in the current culture when the formatter was created, not the current culture now.

For .NET, I am envisioning a static API like the ones Microsoft made that can be used directly by advanced users (accepting a lump of settings and culture data as parameters) with an ICU-like fluent API for novice users or those who want the formatting settings to be "in plain English" in the code.

The ideal solution would extend .NET so message format works with string interpolation and other parts of the framework, but I think there needs to be a discussion with Microsoft to be able to pull that off. It wouldn't be very practical for Microsoft to marry the ICU functionality with the .NET formatters using the underlying ICU library, especially being that there is still an option to "opt out" of ICU in .NET Core.

glen-84 commented 1 year ago

Or fortunate, depending on how you look at it.

Well, I didn't necessarily suggest that they'd just call into the C version. They have more resources to write a custom implementation if they really wanted to.

I wasn't trying to discourage anyone from adding the date, time, and duration features, but was just trying to be helpful on how much work is involved in doing such a thing.

I appreciate that. It's very clear that nothing localization-related is trivial.

Seems a bit strange to do it that way, since messageformat.net is tiny and ICU4N has ~20MB of resource files (the next release will put them into satellite assemblies). IMO, adding the extra features to messageformat.net would be good since they can be combined more easily to build the formatted output.

I'm not sure what you mean – are you suggesting just using .NET format strings? If so, my point is that this will no longer match ICU standards (there will likely be a lot of little differences, plus differences in formatting symbols, etc.), and implementing all the ICU stuff in this library would be a large undertaking.

I think there needs to be a discussion with Microsoft to be able to pull that off

Let me know if you open any issues in this regard, I'd be happy to:+1:and follow along.

jeffijoe commented 1 year ago

I can implement very basic support as mentioned by @glen-84 to at least get the ball rolling. I would just like to know which format codes to use for the various styles.

For example, if I use N for decimal, formatting a decimal input = 69, I get 69.000 for the en culture, is that correct?

glen-84 commented 1 year ago

I don't think there is a decimal style?

I've put some data together (count = 1234567.1234567):

	en-US	de-DE	ar
`{count, number}`
- FormatJS	1,234,567.123	1.234.567,123	١٬٢٣٤٬٥٦٧٫١٢٣
- PHP	1,234,567.123	1.234.567,123	١٬٢٣٤٬٥٦٧٫١٢٣
- C# (Format specifier: "N")	1,234,567.123	1.234.567,123	1٬234٬567٫123
`{count, number, currency}`
- FormatJS (::currency/?)	$1,234,567.12	1.234.567,12 €	١٬٢٣٤٬٥٦٧٫١٢ ر.س.
- PHP (::currency/?)	$1,234,567.12	1.234.567,12 €	١٬٢٣٤٬٥٦٧٫١٢ ر.س.
- C# (Format specifier: "C")	$1,234,567.12	1.234.567,12 €	1٬234٬567٫12 ر.س.
`{count, number, integer}`
- FormatJS	1,234,567	1.234.567	١٬٢٣٤٬٥٦٧
- PHP	1,234,567	1.234.567	١٬٢٣٤٬٥٦٧
- C# (Format specifier: "N0")	1,234,567	1.234.567	1٬234٬567
`{count, number, percent}`
- FormatJS	123,456,712%	123.456.712 %	١٢٣٬٤٥٦٬٧١٢٪
- PHP	123,456,712%	123.456.712 %	١٢٣٬٤٥٦٬٧١٢٪
- C# (Format specifier: "P0")	123,456,712%	123.456.712 %	123٬456٬712٪

Notes:

The .NET localization doesn't seem to localize Arabic numbers.

NightOwl888 commented 1 year ago

The .NET localization doesn't seem to localize Arabic numbers.

As I had previously mentioned, .NET parsers and formatters only support ASCII digits (that is 0-9).

However, .NET does provide the digits for each culture in the NumberFormatInfo.NativeDigits property.

The .NET formatter doesn't have many moving parts and is open source. I have copied it into J2N in order to add additional functionality to it (although, in our case I just wanted to add a "J" (Java) format to it). Simply copying and pasting and then modifying it to display the native digits isn't very complicated except for the fact that the .NET code has optimizations that may not be supported on older versions of .NET that you may need to conditionally compile for. We opted not to add a dependency on System.Memory, but in hindsight that was a mistake.

Note also that round trip formatting is completely broken before .NET Core 3. Copying the code from a recent release of .NET Core is also a way to make the formatting (rounding) consistent between different .NET flavors.

Here are some of the differences I noticed between the .NET formatter and ICU4J.

ICU4J supports a minimum and maximum number of decimal places, but .NET only supports an exact number of decimal places (and only in the "N" or "F" formats). There is no way to make if simply float the way Java does unless using a custom number pattern.
ICU4J has a way to add the currency code in addition to the currency symbol which .NET is lacking. .NET doesn't even have an API where you can get the currency codes. As previously mentioned, there can be more than one currency code per culture, also.
ICU4J has format strings to specify where to put the currency symbol, how to display the negative format, etc. .NET uses an integer for each of these, so they cannot be customized by the end user.

Do note that in .NET we have the decimal format which works for most use cases for currency. However, in Java the formatter uses a BigDecimal type (arbitrarily large number). This implementation seems to be pretty accurate. Its parser seems to accept non ASCII digits, but it looks like they primarily used .NET's built in formatter for displaying the numbers, which are ASCII digits only.

glen-84 commented 1 year ago

As I had previously mentioned, .NET parsers and formatters only support ASCII digits (that is 0-9).

Apologies, I missed that.

I see that someone wanted to work on implementing this in the runtime (https://github.com/dotnet/runtime/issues/47749), but it was declined for questionable reasons. I guess we can just do the substitution ourselves.

@jeffijoe Let me know if you have any other questions.

NightOwl888 commented 1 year ago

Thanks for the link. It makes sense given the huge effort it must have taken to optimize the parsers and formatters, although, being that they have a DigitSubstitution property for it that is "reserved for future use", that could be a switch to go down a slower path I don't really understand why they wouldn't add a slow path that could be enabled using that property.

Non-ASCII digits may include surrogate pairs so using them may require up to 2 chars per digit, which complicates the logic a bit and will definitely be slower than simply using ASCII digits. This is fine as long as the ASCII path is optimized so it doesn't have to deal with double-character substitutions.

FYI - The ICU way of doing substitutions is to allow a "numbers=" parameter on the culture string so the numbering system can be defined when the Locale object is created. ICU4N allows this syntax with the UCultureInfo class (although currently numbering system support is a work in progress).

In .NET, the same functionality is allowed only by subclassing CultureInfo and making the subclass set custom values (in this case, setting the NativeDigits and DigitSubstitution) so it can be re-used. You can also create a NumberFormatInfo object (or clone one) and set the properties manually before passing them to a formatter or parser as a one-off. However, as pointed out it is currently pointless because the formatters and parsers don't support these properties. But messageformat.net could.

jeffijoe commented 1 year ago

@glen-84 if you can get me a map of the various pre-defined styles for dates, times and timestamps and what they map to for the dotnet formatting codes, that would help a lot.

jeffijoe commented 1 year ago

Opened a draft PR @glen-84 @NightOwl888 https://github.com/jeffijoe/messageformat.net/pull/33/files

glen-84 commented 1 year ago

@jeffijoe Sorry about the delay, I'll try to get back to you within the next ~2 weeks.

glen-84 commented 1 year ago

@jeffijoe

I don't know if this is going to be feasible. 😢

There's no medium or full date format in .NET, only short and long.

	en-US	de-DE	ar
`{date, date}`
- FormatJS	1/1/2000	1.1.2000	١/١/٢٠٠٠
- PHP	Jan 1, 2000	01.01.2000	٠١/٠١/٢٠٠٠
- C# (Format specifier: "d")	1/1/2000	01.01.2000	1‏‏/1‏‏/2000
`{date, date, full}`
- FormatJS	Saturday, January 1, 2000	Samstag, 1. Januar 2000	السبت، ١ يناير ٢٠٠٠
- PHP	Saturday, January 1, 2000	Samstag, 1. Januar 2000	السبت، ١ يناير ٢٠٠٠
- C# (Format specifier: "D")	Saturday, January 1, 2000	Samstag, 1. Januar 2000	السبت، 1 يناير 2000
`{date, date, long}`
- FormatJS	January 1, 2000	1. Januar 2000	١ يناير ٢٠٠٠
- PHP	January 1, 2000	1. Januar 2000	١ يناير ٢٠٠٠
- C# (Format specifier: "?")
`{date, date, medium}`
- FormatJS	Jan 1, 2000	1. Jan. 2000	١ يناير ٢٠٠٠
- PHP	Jan 1, 2000	01.01.2000	٠١/٠١/٢٠٠٠
- C# (Format specifier: "?")
`{date, date, short}`
- FormatJS	1/1/00	1.1.00	١/١/٠٠
- PHP	1/1/00	01.01.00	١/١/٢٠٠٠
- C# (Format specifier: "d")	1/1/2000	01.01.2000	1‏‏/1‏‏/2000

If you do decide to proceed with something, I can add another table for time formats.

jeffijoe commented 1 year ago

@glen-84

So I see 2 viable options (short of doing a full-blown implementation which I won't have time for now):

Don't support the style, just format with g by default and support the :: skeleton syntax
Support just the styles that Dotnet supports (of which I need a mapping), and support the :: skeleton syntax

glen-84 commented 1 year ago

I think a middle-ground would be to:

By default, map no style, medium, and short to "d", and full and long to "D".
Allow the user to set a custom .NET format string for each style and locale combination.
- f.e. Set configuration or call a method like SetDateStylePattern(DateStyle.MEDIUM, "de-DE", "m.d.y").
- When formatting a de-DE date with the medium style, it would then use the format m.d.y instead of the default d.
- This would allow users to set custom formats per locale, and/or add appropriate patterns for medium and full.

Regarding skeletons, they're not the same as format patterns, so they would not be simple to implement without full locale data.

See an example here. A skeleton like MMMMdjmm actually expands to a pattern, depending on the locale. For en-US, it may expand to MMMM d 'at' h:mm a, for es_ES to d 'de' MMMM, H:mm, etc.

It's also important to note that the skeleton itself should use the ICU characters for skeletons, which may not match those used in .NET.

For this reason, it may be best not to support skeleton syntax at this time.

glen-84 commented 1 year ago

For time:

	en-US	de-DE	ar
`{time, time}`
- FormatJS	1:01:01 AM	01:01:01	١:٠١:٠١
- PHP	1:01:01 AM	01:01:01	١:٠١:٠١
- C# (Format specifier: "T")	1:01:01 AM	01:01:01	1:01:01 ص
`{time, time, full}`
- FormatJS	1:01:01 AM UTC	1:01:01 UTC	١:٠١:٠١ ص UTC
- PHP	1:01:01 AM Coordinated Universal Time	01:01:01 Koordinierte Weltzeit	١:٠١:٠١ ص التوقيت العالمي المنس
- C# (Format specifier: "?")
`{time, time, long}`
- FormatJS	1:01:01 AM UTC	1:01:01 UTC	١:٠١:٠١ ص UTC
- PHP	1:01:01 AM UTC	01:01:01 UTC	١:٠١:٠١ ص UTC
- C# (Format specifier: "?")
`{time, time, medium}`
- FormatJS	1:01:01 AM	01:01:01	١:٠١:٠١
- PHP	1:01:01 AM	01:01:01	١:٠١:٠١
- C# (Format specifier: "T")	1:01:01 AM	01:01:01	1:01:01 ص
`{time, time, short}`
- FormatJS	1:01 AM	01:01	١:٠١
- PHP	1:01 AM	01:01	١:٠١
- C# (Format specifier: "t")	1:01 AM	01:01	1:01 ص

jeffijoe commented 1 year ago

Set configuration or call a method like SetDateStylePattern(DateStyle.MEDIUM, "de-DE", "m.d.y")

I'd rather not do this as it would require someone to either replace or reach into the formatter library to be able to configure the relevant formatter, or we would need some sort of options bag to pass around. I think mapping as you mentioned would be sufficient. Seeing how inconsistent the behavior is across all these implementations/runtimes makes me feel less bad about it. 😅

glen-84 commented 1 year ago

You already have Pluralizers, so the API could be similar:

var mf = new MessageFormatter();

mf.DateStylePatterns.Add(
    "de-DE",
    style => style switch
    {
        DateStyle.MEDIUM => "m.d.y"
        _ => null // use default format
    });

Just an idea. There may be better designs.

jeffijoe commented 1 year ago

@glen-84 even if we did that, it looks like it would still be inconsistent, seeing as how some libraries return different characters for numbers for certain languages?

NightOwl888 commented 1 year ago

@glen-84 even if we did that, it looks like it would still be inconsistent, seeing as how some libraries return different characters for numbers for certain languages?

You could potentially fix that up by using the TryFormat overloads of numbers and dates. These accept a Span<char> as a parameter. You could then retrieve the NumberFormatInfo for the current culture (there is a GetInstance() method that allows you to get an instance based on culture). NumberFormatInfo.NumberDecimalDigits contains the strings to replace the ASCII digits in the output. If you wanted to provide substitutions, there would probably need to be another mechanism to look up the culture you want to substitute. You may also consider allowing the user to pass an array to define the digit strings.

Do note that in most cases there is 1 character, but there may be surrogate pairs so the logic should be able to handle replacements of 2 (or more) characters. As long as the Span<char> has enough room allocated, this should be no problem. You can simply rewrite the chars within the span without allocating any more memory. The .NET platform has a ValueStringBuilder class which can probably simplify the replacement operations. It has a similar signature as StringBuilder. You can replace the first digit char and if there is more than one, call the Insert() method for the remaining chars.

Also note that the Span<char> can be allocated on the stack provided it is short enough, which should improve the performance of the replacement after the first format operation.

The time zone info is also available in a DateTime. Although, I couldn't get it to work right for timezones that are not UTC or local time zone unless I used DateTimeOffset instead of DateTime to do the formatting. The medium time and medium date can be a custom format strings.

jeffijoe commented 1 year ago

@NightOwl888 it's also a matter of whether or not we should. For instance, the FormatJS and PHP implementations return different outputs for the ar locale.

Currently, we target netstandard2.0 which means Span<T> is not available; if we were to do this, then I would probably want to do some other things first:

Remove support for netstandard2.0
Probably rewrite some internals to take advantage of Span<T> when possible

So I think for now, sticking with the built-in formatting would probably be best until I or someone else has the capacity to take this on. When I originally wrote this library, I just wanted it to support pluralization and choice; I never imagined it would venture into these deep waters of localization. 😅

NightOwl888 commented 1 year ago

Well, you do have a point about Arabic. I admit I didn't consider right to left text, so the digits might not be in the right order just by replacing them like that. And the AM/PM text, months, years, etc. you would need to get from the CLDR.

Actually, the System.Memory package contains the Span<T> data type, which does support netstandard2.0. But I don't think the TryFormat() methods are available before .NET Core. Several other APIs that accept Span<T> are also missing depending on the version targeted. So, yea, to keep netstandard2.0 support, it might be wise to multi-target so you can conditionally compile and gracefully degrade for older versions. You can still support Span<T> for older versions, all this means is that you have to allocate a string on the heap for netstandard2.0 then convert it to a Span<T> to do the replacement. It will just be slower but can do the same thing.

So I think for now, sticking with the built-in formatting would probably be best until I or someone else has the capacity to take this on.

No problem.

I live in Thailand. Thai has its own numbering system, but it is so rarely used I have only ever seen the numbers in temples and way out in small village markets. Most commerce uses ASCII digits, which is probably the same case in most other languages. So, just using the built-in formatting would probably be fine.

When I originally wrote this library, I just wanted it to support pluralization and choice

Well, it still doesn't support choice, which is the only reason I wanted to use it :).

jeffijoe commented 1 year ago

~Sorry, I meant p (to me they are the same thing 🙃 )~

I re-read the issue and remember it was similar to plural 😂

glen-84 commented 1 year ago

@glen-84 even if we did that, it looks like it would still be inconsistent, seeing as how some libraries return different characters for numbers for certain languages?

I see that as a secondary issue. According to this, it should be as simple as replacing the digit characters, but I don't know for sure.

Reading the native digits for ar-SA: https://dotnetfiddle.net/DyKhgg

glen-84 commented 1 year ago

Thanks for working on this Jeff. 👍

jeffijoe commented 1 year ago

You're welcome, and thank you for all your help @glen-84 as well as @NightOwl888 for the research done to make this happen!

NightOwl888 commented 1 year ago

@jeffijoe - Did you end up doing the substitution on native digits? I did that for the RuleBasedNumberFormat functionality of ICU4N after getting the formatted value from .NET. The native digits are supplied in the NativeDigits property of CultureInfo. It can be optimized by checking whether the first digit is a 0 and if so, skip the replacement operation.

jeffijoe commented 1 year ago

@NightOwl888 I did not, I decided to keep it simple. Didn't you also mention some potential caveats with the length of digits in some languages in terms of unicode?

NightOwl888 commented 1 year ago

Yeah, so what we ended up doing was copying the ValueStringBuilder ref struct code from Microsoft (it is internal in .NET). This uses a Span<T> internally so it can be placed on the stack, but exposes a StringBuilder-like API. We check a couple of conditions to see whether ASCII digits should be used and do a single Append in those cases. For cases that fall through, we loop through the chars and convert them one at a time. Therefore, if they are 2 chars they will be replaced appropriately.

https://github.com/NightOwl888/ICU4N/blob/1545efb0950fcbbd17e078fed2c3989faef3548a/src/ICU4N/Support/IcuNumber.Formatting.cs#L220-L251

See the rest of the file for usage examples. We have our own UNumberFormatInfo class because it uses different properties than in .NET, so you will need to rework to use NumberFormatInfo.

This would get it a little closer to the ICU behavior.

The DigitShapes enum (for the NumberFormatInfo.DigitSubstitution property) has a Context setting and I am not sure how that could be made to work using .NET only, since it relies on detecting the culture of the script.

The main entry point into the functionality is here: https://github.com/NightOwl888/ICU4N/blob/1545efb0950fcbbd17e078fed2c3989faef3548a/src/ICU4N/Support/FormatNumberRuleBasedExtension.cs and https://github.com/NightOwl888/ICU4N/blob/1545efb0950fcbbd17e078fed2c3989faef3548a/src/ICU4N/Support/FormatNumberRuleBased.cs. This was redesigned from the RuleBasedNumberFormat class to be static to match .NET formatting behavior, which allows us to use the stack, ReadOnlySpan<T>, optimize overloads by data type to prevent boxing, and eliminate the need to lock to provide thread safety.

The NumberFormatRules instance is passed into the static method to control formatting behavior (which can either be culture-based or custom - although custom isn't yet supported).

jeffijoe commented 1 year ago

Thanks for the info!

I will be happy to accept a PR for this, but it's not something I myself would be doing right now since we still target platforms that don't have Span, and so implementing this to support non-Span as well would not be fun.

NightOwl888 commented 1 year ago

Well, as previously mentioned, it just takes a dependency on the System.Memory package to add support for ReadOnlySpan<T>, which has support for netstandard2.0 and .NET Framework.

jeffijoe commented 1 year ago

If the System.Memory package does not add any overhead for platforms with native support, then that could work.

Are you willing to work on this?

NightOwl888 commented 1 year ago

Are you willing to work on this?

I am, but I am underwater on a few projects right now.

jeffijoe / messageformat.net

Number, date, time, duration formatters #17