[FEEDBACK] The data model could be simplified for declarations

eemeli commented 4 months ago

This was originally posted as a part of #716, but is broken out here as its own issue.

While working on some Python code, I ended up needing to put together a pythonic representation of the message data model. While doing so, I encountered a few places where I could apply some simplifications to the data model with no loss of fidelity.

I think we should consider applying this change to how declarations are represented in the data model:

 interface PatternMessage {
   type: "message";
-  declarations: Declaration[];
+  declarations: Map<string, Expression>;
+  statements: UnsupportedStatement[];
   pattern: Pattern;
 }

 interface SelectMessage {
   type: "select";
-  declarations: Declaration[];
+  declarations: Map<string, Expression>;
+  statements: UnsupportedStatement[];
   selectors: Expression[];
   variants: Variant[];
 }

-type Declaration = InputDeclaration | LocalDeclaration | UnsupportedStatement;
-interface InputDeclaration { type: "input", name: string, value: VariableExpression }
-interface LocalDeclaration { type: "local", name: string, value: Expression }

As it's an error for multiple declarations to target the same variable name, and the resolution of a .input declaration doesn't actually differ from the resolution of a .local, we don't need to care about their syntactic differences or order in the data model.

We should also consider separating the representation of unsupported statements away from declarations, given that they're not necessarily at all related. Doing so would add a new restriction on future statements not being allowed to exhibit different valid behaviour based on their relative order w.r.t. .input and .local, but that doesn't seem too onerous.

catamorphism commented 4 months ago

I don't see how duplicate declaration errors can be checked without keeping this distinction in the data model. Consider:

Message 1

.input {$x: number}
{{_}}

Message 2

local $x = {$x: number}
{{_}}

Message 1 is error-free, while Message 2 has a duplicate declaration error since the .local shadows the implicit declaration of $x. If the parser (effectively) rewrites Message 1 to Message 2, I don't know how to write a separate pass that checks the data model for duplicate declaration errors.

(Perhaps there's a way to check those errors while parsing, but I would oppose requiring implementations to mix data model error checking with parsing. Checking data model errors during formatting also seems undesirable to me.)

eemeli commented 4 months ago

That's an interesting point. The change I'm proposing here would effectively make some current data model errors unrepresentable in the data model. An MF2 syntax message could still contain such errors, but constructing a data model from that syntax would not be possible.

Do we want the data model to have the same failure modes as the syntax? For example, the syntax requirement "A declaration MUST NOT bind a variable that appears as a variable anywhere within a previous declaration" is there to prevent dependency loops and make certain edge cases more readable for humans, but that restriction is (intentionally) stricter than it absolutely needs to be, and in the data model if declarations were an unordered collection, the corresponding requirement would need to be the more specific "no dependency loops", so that the declarations could be ordered in a way to satisfy the syntax requirement.

macchiati commented 4 months ago

declarations were an unordered collection

Right, that can't be done unless there is some additional information so that the correct order can be regenerated (eg dependency graph)

On Mon, Mar 11, 2024 at 12:36 PM Eemeli Aro @.***> wrote:

That's an interesting point. The change I'm proposing here would effectively make some current data model errors unrepresentable in the data model. An MF2 syntax message could still contain such errors, but constructing a data model from that syntax would not be possible.

Do we want the data model to have the same failure modes as the syntax? For example, the syntax requirement "A declaration MUST NOT bind a variable that appears as a variable anywhere within a previous declaration" is there to prevent dependency loops and make certain edge cases more readable for humans, but that restriction is (intentionally) stricter than it absolutely needs to be, and in the data model if declarations were an unordered collection, the corresponding requirement would need to be the more specific "no dependency loops", so that the declarations could be ordered in a way to satisfy the syntax requirement.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/718#issuecomment-1989281880, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGTH6VQRTQJNVI4QOLYXYBTFAVCNFSM6AAAAABEOHEH4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBZGI4DCOBYGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

catamorphism commented 4 months ago

That's an interesting point. The change I'm proposing here would effectively make some current data model errors unrepresentable in the data model. An MF2 syntax message could still contain such errors, but constructing a data model from that syntax would not be possible.

If it's unrepresentable in the data model, that means either that (1) (at least some) duplicate declaration errors have to be checked at parse time, or that (2) Message 2 is no longer an error (since it parses to the same data model instance as Message 1), right? (I guess a third option is to introduce an intermediate implementation-specific data model that is the output of the parser, which is transformed to the canonical data model after error checking, but I don't see the benefit of that.)

(1) seems like a bad idea to me because keeping the parser as simple as possible, with a separate semantic checking pass, appeals to me. As for (2), I think if we take that path we should eliminate .input altogether, but the choice to include it was the result of a lengthy discussion and I don't know that starting it over is going to lead to a better conclusion.

eemeli commented 4 months ago

The additional information is encoded within the declaration expressions. At its simplest, a message like

.input {$foo}
.local $bar = {$foo}
{{...}}

could be represented as:

{
  type: 'select',
  declarations: {
    bar: { type: 'expression', arg: { type: 'variable', name: 'foo' } },
    foo: { type: 'expression', arg: { type: 'variable', name: 'foo' } }
  },
  pattern: ['...']
}

With the simplification proposed here, getting to that representation would require checking for duplicate declarations, so if that didn't happen, we can treat the data model as valid. Therefore, because the foo expression references $foo, that must be a .input; for similar reasons, the bar expression must be a .local. Also, because bar refers to foo, it must come after it, and the data model can be serialised with an equivalent representation to the original syntax.

eemeli commented 4 months ago

If it's unrepresentable in the data model, that means either that (1) (at least some) duplicate declaration errors have to be checked at parse time, or that (2) Message 2 is no longer an error (since it parses to the same data model instance as Message 1), right? (I guess a third option is to introduce an intermediate implementation-specific data model that is the output of the parser, which is transformed to the canonical data model after error checking, but I don't see the benefit of that.)

The intent here would be for something like the first or third option you propose; .local $x = {$x} should continue to be an error.

(1) seems like a bad idea to me because keeping the parser as simple as possible, with a separate semantic checking pass, appeals to me.

But the checks still need to be done sometime, right? So we're not talking about adding complexity, just moving it from one operation to another.

aphillips commented 4 months ago

A data model that cannot represent (some) invalid messages seems better than one that can reproduce with fidelity known bugs. It will help prevent, e.g. tools from doing bad things. "Hey, I can't insert this error into the data model..." is probably followed by "... what am I doing wrong?"

I do think that switching to a map and thus losing the ordering of the declarations, while appealing as a simplification, might produce churn for some users (among other things, it probably canonicalizes that .input goes ahead of .local and tools probably alphabetize/code point order sorts the declarations). I'm not sure this is a Bad Thing either?

catamorphism commented 4 months ago

I'm strongly against switching to a map and losing the order of the declarations. At least for options, it's fairly easy to alphabetize the option names and use that as the canonical representation for doing round-trip testing.

For declarations, though, a topological sort would need to be done in order to get a valid message back. Given how important the order of declarations is in the semantics, it doesn't make sense to me not to preserve that information in the data model.

catamorphism commented 4 months ago

But the checks still need to be done sometime, right? So we're not talking about adding complexity, just moving it from one operation to another.

I mean modularity more than simplicity, I guess. Parsers normally check syntax errors and nothing else.

I don't see the benefit of the implementation having to be structured like:

parser produces DataModel1 -> semantic checker consumes DataModel1 and produces DataModel2 -> formatter consumes DataModel2 (where DataModel2 is the one that's formalized in the spec)

I would see the benefit if it was possible to design the types so that DataModel2 is correct by construction, and I see your suggestions to make it impossible to represent duplicate declaration errors as a move towards being correct by construction. However, there are other data model errors that seem very hard to make unrepresentable: for example, "missing selector annotation", which requires a dataflow analysis to check (albeit a simple one).

So I don't see the benefit of making the data model "more correct by construction, but not fully correct by construction", given the cost of adding more moving parts to the implementation (and making re-serialization less straightforward).

mihnita commented 4 months ago

I am for a map.

And let me explain why.

A map does not imply changing the order of the parameters. It is an abstract data type. Often called "associative array" https://en.wikipedia.org/wiki/Associative_array

A map as an abstract data type can be implemented in many ways. Most common is some kind of tree, or hash. But one can use two arrays, one or keys and one of values. Or one array of pairs key-value (which is basically what we have now)

When I hear "this is a map" then certain "map-like behavior" comes instantly to mind:

you can't add / remove a key, or a value, they have to always be handled in pairs
you usually go from a key to a value, and that finding the key is a relatively fast operation, but going from the value it is not
the key is (usually) a way simpler type than the value
you don't have duplicated keys

So I expect a map (as an abstract data type) to have a few operation:

put(key, value)
get(key) -> value
remove(key); get(key) => fails
put(key, value1); put(key, value2); get(key) == value2

All of that without one telling me more than "this is a map"

You can also say "ordered map", and then I know it preserves the order of the insertion when iterate over keys.

And (at least some) programming languages I can choose a map type that does exactly that(LinkedHashMap in Java). For languages that don't have it, I know I have to implement it somehow.

In the early days, in the data model I proposed I used an ordered map:

export type OrderedMap<K, V> = Map<K, V>;

export interface ISelectorMessage extends IMessage { // Xliff spec need extesion. Proposal in the works.
    selectorArgs: ISelectorArg[];
    messages: OrderedMap<ISelectorVal[], ISimpleMessage>;
}

https://github.com/unicode-org/message-format-wg/blob/experiments/experiments/data_model/ts_mihai/src/imessageformat.ts#L34

TLDR: we can argue if the change is a good idea or not, but the order preservation is an issue.

mihnita commented 4 months ago

The one issue I would have with the proposal as is is that by splitting the declarations from statements (which can only be UnsupportedStatement) we introduce a problem for future compatibility.

What happens when we introduce the .foo statement in 2 years? It is not unsupported anymore. Should it move to declarations? And now the order matters? Or should it remain an UnsupportedStatement in the statements bucket forever, even if it is now supported?

It would make sense to become a declaration, and move in the declarations bucket. And preserve order.

Let's take this:

.input ...
.foo ...
.local ...
.this_is_still_unsupported ...
.mihai_unsupported ...

But an implementation that does not know about .foo will read this message, and because we separated the UnsupportedStatement from the rest it will serialize this back as:

.input ...
.local ...
.foo ...
.this_is_still_unsupported ...
.mihai_unsupported ...

Note how .foo changed position, a bad thing.

This is why I would oppose the proposal "as is", not because it is a map.

If we modify the UnsupportedStatement to have a key, then we can mix them in with the rest of the other declaration types, and put them in a map.

But that adds a restriction for the future, in that all statements we will introduce will have to have some kind of "key". That is not the case now:

interface UnsupportedStatement {
  type: "unsupported-statement";
  keyword: string;
  body?: string;
  expressions: Expression[];
}

The keyword is not a key, it is the name after the dot. So we can have more than one statement with the same keyword:

.foo ...
.foo ...

Does not work in a map.

catamorphism commented 4 months ago

A map does not imply changing the order of the parameters. It is an abstract data type.

Mihai is right -- my comment was too vague, and I meant I was opposed to using an unordered map.

However, I'm not sure if there's a way to specify in the data model spec that it is ordered, unless in the spec, we write something like:

interface OrderedMap<T, U> {
   keys: [T];
   value: Map<T, U>;
}

and use that instead of Map.

Maybe I don't know enough about TypeScript, though, and there's a more abstract way to express this.

eemeli commented 3 months ago

Maybe I don't know enough about TypeScript, though, and there's a more abstract way to express this.

The proposed TypeScript definition is using a JavaScript Map, which retains insertion order, so is an ordered map. Actually, JS Objects are also similarly ordered, though they weren't originally.

However, JSON Objects are not ordered, so using an object in the JSON Schema for representing declarations while requiring them to be ordered would add a requirement that many JSON parsers do support, but which is not standard. Which we probably don't want to do.

And so the map representation that I'm proposing here is, as mentioned in https://github.com/unicode-org/message-format-wg/issues/718#issuecomment-1989281880, an unordered map. If we're not happy with that idea, then we probably should not change the current data model representation.

catamorphism commented 3 months ago

And so the map representation that I'm proposing here is, as mentioned in #718 (comment), an unordered map. If we're not happy with that idea, then we probably should not change the current data model representation.

Thanks for the explanation. I'm not happy with the idea of using an unordered map, because it makes it hard to specify what the canonical form of a message is; without well-defined rules for canonicalizing a message, it's hard to re-serialize in a predictable way.

macchiati commented 3 months ago

order is not significant on input, but
canonical order when serializing is alphabetic by option id.

That is, I agree with having the canonical order be specified, and would have it be a simple rule across all functions.

On Mon, Mar 25, 2024 at 3:04 PM Tim Chevalier @.***> wrote:

And so the map representation that I'm proposing here is, as mentioned in #718 (comment) https://github.com/unicode-org/message-format-wg/issues/718#issuecomment-1989281880, an unordered map. If we're not happy with that idea, then we probably should not change the current data model representation.

Thanks for the explanation. I'm not happy with the idea of using an unordered map, because it makes it hard to specify what the canonical form of a message is; without well-defined rules for canonicalizing a message, it's hard to re-serialize in a predictable way.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/718#issuecomment-2018999305, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMG5IGTGBGL72JP3NI3Y2CNPHAVCNFSM6AAAAABEOHEH4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYHE4TSMZQGU . You are receiving this because you commented.Message ID: @.***>

aphillips commented 3 months ago

canonical order when serializing is alphabetic by option id.

Note that the value space for option names is identifier, so probably not "alphabetical". Code point order?

macchiati commented 3 months ago

That's what I meant, should have been more precise. Yes, code-point order.

On Mon, Mar 25, 2024 at 4:25 PM Addison Phillips @.***> wrote:

canonical order when serializing is alphabetic by option id.

Note that the value space for option names is identifier, so probably not "alphabetical". Code point order?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/718#issuecomment-2019097531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMC5BDD4OF5CPGRJOO3Y2CXAHAVCNFSM6AAAAABEOHEH4CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJZGA4TONJTGE . You are receiving this because you commented.Message ID: @.***>

catamorphism commented 3 months ago

Note that we're now talking about a different issue, #716 . Canonicalizing options is relatively simple.

Canonicalizing declarations is not simple if ordering is not preserved in the data model, as dependencies have to be computed in order to re-serialize. That's what we're talking about in this issue.

eemeli commented 3 months ago

Out of interest in what "not simple" would mean in practice, I wrote up the following function, which checks that an unordered map of declarations does not contain any reference loops, and sorts them in a stable order:

function sortAndCheckDeclarations(declMap: Map<string, Expression>) {
  const res: (InputDeclaration | LocalDeclaration)[] = [];
  const dependencies = new Map<string, Set<string>>();
  function getDeepDependencies(name: string, res = new Set<string>()) {
    for (const dep of dependencies.get(name) ?? []) {
      if (!res.has(dep)) {
        res.add(dep);
        getDeepDependencies(dep, res);
      }
    }
    return res;
  }
  for (const [name, value] of declMap) {
    const argName = value.arg?.type === 'variable' ? value.arg.name : undefined;
    const type = argName === name ? 'input' : 'local';
    res.push({ type, name, value } as InputDeclaration | LocalDeclaration);
    const deps = new Set<string>();
    if (type === 'local' && argName) deps.add(argName);
    for (const opt of value.annotation?.options ?? []) {
      if (opt.value.type === 'variable') deps.add(opt.value.name);
    }
    dependencies.set(name, deps);
  }
  for (const name of dependencies.keys()) {
    const deps = getDeepDependencies(name);
    if (deps.has(name)) throw new Error(`Reference loop with ${name}`);
    dependencies.set(name, deps);
  }
  res.sort((a, b) => {
    if (dependencies.get(b.name)!.has(a.name)) return -1;
    if (dependencies.get(a.name)!.has(b.name)) return 1;
    if (a.type === 'input' && b.type === 'local') return -1;
    if (b.type === 'input' && a.type === 'local') return 1;
    return a.name < b.name ? -1 : 1;
  });
  return res;
}

unicode-org / message-format-wg

[FEEDBACK] The data model could be simplified for declarations #718