Data model feedback: I think we should have string and numeric literals

unicode-org / message-format-wg

Developing a standard for localizable message strings

Other

229 stars 33 forks source link

Data model feedback: I think we should have string and numeric literals #712

Open mihnita opened 6 months ago

mihnita commented 6 months ago

At this point the data model only has string literals:

interface Literal {
  type: "literal";
  value: string;
}

The parser also has number-literal

literal        = quoted / unquoted
quoted         = "|" *(quoted-char / quoted-escape) "|"
unquoted       = name / number-literal

When we format a message we use the data model only. Which means there is no way to tell the difference between "...{|123456789|}..." and "...{123456789}..." Because in the data model we only have a string, and "The presence or absence of quotes is not preserved by the data model."

But I think one would expect that {|123456.789|} to result in "123456.789" (because it is a string), and would expect {123456.789} formatted as "123,456.789" (or "123.456,789", maybe with alternate digits). Because "it is a number".

It means the placeholders without functions are not intuitive: "...{123456789}..." => "...123456789..." "...{123456789 :number}..." => "...123,456.789}..."

Numeric literals are also found in options: ...{$foo :function opt1=bar opt2=baz opt3=42}..., and in decision keys.

TLDR: We have numeric literals in syntax. We need to know if a literal was numeric when we format to string. But we drop that info in the data model, which sits in the middle.

eemeli commented 6 months ago

https://github.com/unicode-org/message-format-wg/blob/e76196481b23e6e9245923a1239282e19484efd0/spec/formatting.md?plain=1#L175-L178

aphillips commented 6 months ago

Numeric literals are not numbers. They are a sub-production of literal that makes it convenient to use numeric values in the syntax. We have number-literal instead of mutating name a bunch.

That is, it is acceptable to add quotes to any numeric-literal.

I see the problem that you're grappling with, @mihnita, which is that you can't reflect off of a string in a placeholder to get a number. You might like number-literal to turn into a number. but {|123|} is just as valid as {123}. What I think you'll have to do to get the intuitive behavior you're after is check if the literal parses as a number in order to support automatic assignment of :number instead of :string.

macchiati commented 6 months ago

Literals in the message text are always strings. Their structure is given meaning by the function that looks at them. We might have some specific formats defined in by the standard registry functions that can be shared across other functions, inside and outside the registry, but a custom function can define its own structure for a literal. Moreover, there are different environments where literals are found, and the structure may be specific to them.

For example:

.local $consumption {|1-liter-per-100-kilometers| :u:measure} .match {$distance} |[0,5)-meter| {{Not far enough.}}. ... The second literal above represents a range (open, closed, or half-open It would not be acceptable as the operand of a .local for :u:measure, but it might be for a u:measureRange.

I think literals as part of a message part, because they have no function, cannot be other than just strings; typically because some character needs escaping.

On Fri, Mar 8, 2024 at 4:32 PM Addison Phillips @.***> wrote:

Numeric literals are not numbers. They are a sub-production of literal that makes it convenient to use numeric values in the syntax. We have number-literal instead of mutating name a bunch.

That is, it is acceptable to add quotes to any numeric-literal.

I see the problem that you're grappling with, @mihnita https://github.com/mihnita, which is that you can't reflect off of a string in a placeholder to get a number. You might like number-literal to turn into a number. but {|123|} is just as valid as {123}. What I think you'll have to do to get the intuitive behavior you're after is check if the literal parses as a number in order to support automatic assignment of :number instead of :string.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/712#issuecomment-1986600567, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFKCDAVPGBIAQH6NY3YXJKALAVCNFSM6AAAAABENS2OI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGYYDANJWG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mihnita commented 6 months ago

What I think you'll have to do to get the intuitive behavior you're after is check if the literal parses as a number in order to support automatic assignment of :number instead of :string.

This is independent of :number We have places where we take numbers in options (...{$foo :bar opt=21}...).

The function :bar should not know about :number Every single function taking "numeric options" will need a way to parse a string to a number :-( Without calling the :number (internal) parser.

And the MF2 implementation itself should not know about :number (which is a function like any other, it "just happens" to be standard).

And when I say "intuitive behavior" I am mostly thinking about a users of MF2, someone writing messages. Intuition works without thinking. If I have to read the registry and decide "ok, this is parseable by :number" then it is not intuition anymore. It is "learn to live with it" against intuition.

And that intuition might be programming language dependent :-) 1 == "1" is true in JavaScript and Perl, but not in Java or Python.

Literals in the message text are always strings

Absolutely. But here we are talking about the data model.

aphillips commented 6 months ago

@mihnita I think options are the same thing. Functions need to specify what string serialization they accept. For an expression like {$count :number minimumFractionDigits=1}, the 1 has to be a specific pattern which the :number backing function parses into the value.

In your case, you're probably using NumberFormatter as your ultimate formatter, but you'll have some code that parses the option value to make it into a number (or kvetches that it isn't sufficiently numeric).

In the data model, the value of the option is a string. In the function registry, the value of that string might be constrained.

eemeli commented 6 months ago

@mihnita How would you represent the operand of this expression in the data model?

{ 1.00 :x:number }

mihnita commented 6 months ago

@mihnita How would you represent the operand of this expression in the data model?
{ 1.00 :x:number }

Same as today, except that 1.00 would be a NumberLiteral instead of Literal. And {|1.00| :x:number} would be a StringLiteral.

Same as JS and most programming languages, 1.00 is a number, "1.00" or '1.00' is a string. So 1.00 == 1.0, but |1.00| != |1.0|

type Literal = StringLiteral | NumberLiteral

interface StringLiteral {
  type: "string-literal";
  value: string;
}

interface NumberLiteral {
  type: "number-literal";
  value: number;
  source: string; // Maybe, TBD
}

macchiati commented 6 months ago

I think it can be misleading to talk about 'the' data model, without context.

"This section defines a data model representation of MessageFormat 2 messages.

Implementations are not required to use this data model for their internal representation of messages. Neither are they required to provide an interface that accepts or produces representations of this data model." So I presume you mean, in the data model (used by a particular implementation). And there can be a lot of variation.

An implementation could produce a 'deep' data model where as much as possible is transformed into internal data types and optimized for runtime use. In this particular case, it would do that by calling x:number to parse the literal operand, and that 1.00 might be represented as a double, a BigNumber, a Rational, a ComplexNumber, or some other datatype. Similarly, in {$var :number numberingSystem=arab} the option value might be converted to a NumberingSystem enum, so that at runtime it does not need to be parsed in order to pass it to a UnlocalizedNumberFormatter. In fact, most of the :number options could be used to build an UnlocalizedNumberFormatter, and then at runtime the only additional parameters that are needed are the locale and the value of $var.

On Sun, Mar 10, 2024 at 1:28 AM Mihai Nita @.***> wrote:

@mihnita https://github.com/mihnita How would you represent the operand of this expression in the data model?

{ 1.00 :x:number }

Same as today, except that 1.00 would be a NumberLiteral instead of Literal. And {|1.00| :x:number} would be a StringLiteral.

Same as JS and most programming languages, 1.00 is a number, "1.00" or '1.00' is a string. So 1.00 == 1.0, but |1.00| != |1.0|

type Literal = StringLiteral | NumberLiteral interface StringLiteral { type: "string-literal"; value: string;} interface NumberLiteral { type: "number-literal"; value: number; source: string; // Maybe, TBD}

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/712#issuecomment-1987158284, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBQNAPLFW5XIHBXXBDYXQRSHAVCNFSM6AAAAABENS2OI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGE2TQMRYGQ . You are receiving this because you commented.Message ID: @.***>

catamorphism commented 6 months ago

IMO distinguishing between types of literals isn't too useful without introducing a type system.

On the one hand we have "all literals are strings". On the other hand, we could introduce typing rules, which could mean requiring input variables to be annotated with types, or could mean a sort of hybrid approach where type errors involving only literals are statically checked (that is, checked whenever data models are checked). My feeling is that points on the design spectrum between those two points aren't too helpful, because eventually you stumble into a type system and you might as well start out with one.

I'm not against a type system, but it might take some thought to figure out how to let custom function writers specify the types of their functions in a programming-language-neutral way. It would be a hard problem how to reconcile a type system for MessageFormat with the ability to write custom functions and the possibility that those functions might be implemented in a unityped language like JS.

macchiati commented 6 months ago

I think the spec should be neutral as to whether the implementation uses strong typing, weak typing, or completely untyped. That is, a data model in a real implementation should be able to use strong typing, but we should not prescribe it.

That is, it would be perfectly fine to have an implementation generate a data model where a string in the message source like

.local $var = {1 :x:number style=compact foo=bar}

Turns into strong-typed, front-loaded:

MFFunction f = registry.lookup("x:number"); put("var", new Expression(f.parseLiteral("1"), f, f.parseOptions("style=compact foo=bar"));

where f.parseLiteral("1") produces new Complex(1,0) f.parseOptions produces Map.of("style", Style.valueOf("compact")));

Or, it could turn into the completely untyped, back-loaded:

put("var", new Expression("1", "x:function", "style=compact, foo=bar"))

On Mon, Mar 11, 2024 at 11:26 AM Tim Chevalier @.***> wrote:

IMO distinguishing between types of literals isn't too useful without introducing a type system. I'm not against that, but it might take some thought to figure out how to let custom function writers specify the types of their functions in a programming-language-neutral way. It would be a hard problem how to reconcile a type system for MessageFormat with the ability to write custom functions and the possibility that those functions might be implemented in a unityped language like JS.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/712#issuecomment-1989154427, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCBCDTOUGXVBB6AN6DYXYAM3AVCNFSM6AAAAABENS2OI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBZGE2TINBSG4 . You are receiving this because you commented.Message ID: @.***>

aphillips commented 6 months ago

@macchiati

"The" data model in our discussions refers to the data model defined in the specification. It is intended as an interchange format and thus can be formalized. Implementations are not required to implement it (or any other data model) and we say this explicitly. They can also extend "the" data model.

I think the spec should be neutral as to whether the implementation uses strong typing, weak typing, or completely untyped. That is, a data model in a real implementation should be able to use strong typing, but we should not prescribe it.

We go out of our way not to be typed or to favor a given type system, but we recognize that implementations cannot avoid typing. The whole point of message formatting, after all, is to insert data values in a locale-appropriate way into a string. This dichotomy is why the spec has tortured locutions about "implementation defined types": we never say what these types are and we generally restrict discussion of them to registry.md. The only way to coerce a type is via a function (annotation). MF never knows nor cares about the types of operands or any values. Only the (locally-supplied) functions care. At the same time, we don't require implementations to remove typing information either.

aphillips commented 2 days ago

This is related to the discussion we had in the 2024-09-16 call, which we deferred resolution until 46.1