Clarify that standalone markup is permitted.

unicode-org / message-format-wg

Developing a standard for localizable message strings

Other

236 stars 34 forks source link

Clarify that standalone markup is permitted. #356

Closed aphillips closed 1 year ago

aphillips commented 1 year ago

Is your feature request related to a problem? Please describe. From our Slack discussion of 2023-02-18/19/20, our documentation should be clear that we allowed unpaired markup.

Describe the solution you'd like See above

Describe why your solution should shape the standard It's fundamental to parsing/validation

Additional context or examples Pasting the Slack conversation:

Agree about empty string. We currently do not allow unpaired markup (we say so explicitly). This is not consistent with HTML, as you note, but I don't know what went into that decision. +1000 on fuzzing.

Mihai 2 days ago I argued many times that markup dies not have to be paired. And that we don't need markup at all :-)

Mihai 2 days ago s / dies not / did not /

Mihai 2 days ago This is the problem with landing things where we have no agreement.

Mihai 2 days ago What went into that decision was time pressure, and the spec landed 2 days before people were living for the summer. It was either that, or miss the ICU release. (edited)

Mihai 2 days ago I didn't implement markup in the ICU tech preview, and I have an open issue. But this is what I tried to warn about in the last meeting. Agree that in a somewhat unarticulated and emotional. But I was seeing it happening again right there

Mihai 2 days ago Maybe in spec we should mark things that are not agreed with a special tag. Otherwise "we decided", and changing them requires more effort. You are "challenging an existing decision" (and in that meeting there was also a push to make that harder)

Mihai 2 days ago And "just file an issue" is not a solution. Give it some time and it becomes "Mihai is challenging a decision" instead of "we submitted with disagreement". All that we have as a recording is people remembering.

Mihai 2 days ago (meaning: not much)

Mihai 2 days ago Sorry, I should not bring these "heavy topics" over a (long) weekend, over chat, and at unfriendly hours for people in other time zones. I will stop, and choose another avenue :+1: 1

Eemeli 21 hours ago We currently do not allow unpaired markup (we say so explicitly). This is not consistent with HTML, as you note, but I don't know what went into that decision. Where do we say so explicitly? Doesn't this section say the opposite, as it states that "[markup elements] do not require well-formedness"?

Staś Małolepszy 10 hours ago Yeah, my impression was also that unpaired markup is fine. My comment above about the unpaired {-m} being invalid was wrong, sorry about that :+1: 1

Addison Phillips 4 minutes ago Non-well-formedness means lots of things. For example, it can mean cross-over tags: some {+a}{+b}mixed tags{-a}{-b}

Addison Phillips 3 minutes ago We don't show standalone tags anywhere in the examples nor do we explain that they are permitted. I'm pretty sure I saw paired-only somewhere recently in our docs, hence my comment. If we mean to allow standalone tags, we should say so explicitly.

mihnita commented 1 year ago

I think that we should not have markup at all. It is not needed, and the current spec for markup is hard to use to support more than one kind of markup. And that is needed. We had long and heated arguments about it.

All the formatting can be handled with placeholders.

Markup open issues:

And to show that it was not agreed on, see PR https://github.com/unicode-org/message-format-wg/pull/283 in Jun 21, 2022:

I will approve it to move things forward and not have these many open PRs. It does not mean I agree with this syntax, so it should not be hold against me 1 month from now ("but you approved X") There are still at least 2 open issues arguing about how to best handle "markup elements"

mihnita commented 1 year ago

We currently do not allow unpaired markup (we say so explicitly).

This goes directly against the Unicode TC decision: "Should not assume well-formedness of elements"

echeran commented 1 year ago

This issue is interesting because it touches on a mix of technical and process/meta-level concerns. Although many are intertwined, so maybe we can discuss how to discuss?

Technical: I agree that we should call out standalone tags as explicitly allowed with examples. Although that brings up the question of syntax for them, and the higher-level question of what benefits we gain for entertaining special syntax for markup.

Meta: We had an issue to discussion how to represent standalone markup tags (#239), but it got closed in favor of #238, which is still open. We also had issue #269 about sigils for markup tags, which was closed after PR #283 was merged, even though it handles open/close tags only. That leaves us in this curious state where open/close markup tags have special syntax in the current spec, and I think standalone markup tags use regular placeholder syntax.

Meta/Technical: We have multiple open issues about various details of markup syntax, but we still haven't addressed the question of what special syntax for markup tags buys us in the first place, on top of just using regular placeholder syntax. I asked about this in #262, which is still open. I don't recall hearing a thorough & compelling answer so far.

We had long and heated arguments about it.

@mihnita I don't think the length or emotional content of arguments imply the importance of the conclusion, but rather, the technical merits ought to be the main determinant, and I think we have been doing well in returning to the technical points and avoid non-technical feelings and distractions. It might be good for us to consider things in terms of tradeoffs and asking pros & cons, and maybe evaluative criteria on top of that, if arguments get repetitive.

aphillips commented 1 year ago

Let's discuss how to approach this Monday.

Some specifications have features that are "at risk" and I think markup would qualify as an "at-risk" feature in our spec. We have proposed a syntax to support it, but don't have implementations or clarity on requirements and functionality. This is a recipe for problems when we discover those requirements later...

Some approaches suggest themselves:

We could avoid defining markup support now, but reserve a number of sigils and tokens for future standardization that could include markup support. This would allow us to avoid having an at-risk feature in our specification while preserving the ability to add support for it later. Processors could optionally swallow such placeholders if they did not understand them or optionally warn/error on them. We would probably want to add more sigils than just + and -.
We could provide a "private use placeholder" syntax which implementations could use to implement markup or some other scheme (for example TTS management). Processors could optionally swallow such placeholders if they did not understand them or optionally warn/error on them.
We could adopt an approach like XLIFF, in which any markup is part of the prose, but we provide placeholders to signal to the translation system what to protect and provide metadata. XLIFF takes the approach that markup in content is just string data which can be protected in various ways, notably <ph> for standalone placeholders and <bpt>/<ept> for paired placeholders (where the pair need to appear in the translation and be in the same order, even if the contents between them changes and the position in the string changes).

I think (3) is interesting because one of the challenges with markup is that for Web platforms there are often two levels of it. There are HTML tags and then there are templating languages used by page generation frameworks. There are an unreasonable number of templating languages, many homegrown. If we took an XLIFF approach that might be something we could accomplish and build implementation around (since we don't actually have to build support for pluggable markup regimes). However (1) and (2) are also viable (and these options are not necessarily mutually exclusive--we could do combinations of these or all three).

eemeli commented 1 year ago

To help in our deliberations, I'd like to submit the following three real-world Fluent messages to be used as examples that I believe we should be able to support in MF2. These are all from preferences.ftl, lines 92, 125 and 843:

## Extension Control Notifications
##
## These strings are used to inform the user
## about changes made by extensions to browser settings.
##
## <img data-l10n-name="icon"/> is going to be replaced by the extension icon.
##
## Variables:
##   $name (string) - Name of the extension

extension-controlling-password-saving =
  <img data-l10n-name="icon"/> <strong>{ $name }</strong> controls this setting.

## Preferences UI Search Results

search-results-help-link =
  Need help? Visit <a data-l10n-name="url">{ -brand-short-name } Support</a>

## Firefox Account - Signed out

# This message contains two links and two icon images.
#   `<img data-l10n-name="android-icon"/>` - Android logo icon
#   `<a data-l10n-name="android-link">` - Link to Android Download
#   `<img data-l10n-name="ios-icon">` - iOS logo icon
#   `<a data-l10n-name="ios-link">` - Link to iOS Download
#
# They can be moved within the sentence as needed to adapt
# to your language, but should not be changed or translated.
sync-mobile-promo =
  Download Firefox for <img data-l10n-name="android-icon"/> <a data-l10n-name="android-link">Android</a> or
  <img data-l10n-name="ios-icon"/> <a data-l10n-name="ios-link">iOS</a> to sync with your mobile device.

In the above, the <a> and <img> with data-l10n-name attributes are DOM elements for which other attributes (such as src and href) are provided elsewhere, while  is inline markup that's a part of the message.

According to my understanding of how our current markup syntax ought to work, I believe this could be a representation of the above in MF2, presuming that appropriate :brand, :img, +a, and +strong are available:

extension-controlling-password-saving =
  {{:img key=icon} {+strong}{$name}{-strong} controls this setting.}

search-results-help-link =
  {Need help? Visit {+a}{(short-name) :brand} Support{-a}}

sync-mobile-promo =
  {Download Firefox for {:img key=android-icon} {+a key=android-link}Android{-a} or
   {:img key=ios-icon} {+a key=ios-link}iOS{-a} to sync with your mobile device.}

aphillips commented 1 year ago

@eemeli noted:

According to my understanding of how our current markup syntax ought to work, I believe this could be a representation of the above in MF2, presuming that appropriate :brand, :img, +a, and +strong are available:

Your description of the current "theory of markup" sounds correct to me. That presumption at the end is the nub of the problem. How does one ensure that +a and +strong are available?

Presumably, the runtime that supported your HTML markup would consume the {+mumble} and {-mumble} tags and produce the correct string output. This could take the form of code that reads the input and produces (perhaps very) different markup as output.

For example, it wouldn't be unreasonable to have style in your source:

Visit <strong style="text-alignment: left; background-color: pink">here</strong>

That requires extra processing to make into valid MF2, because the quoting and perhaps other stuff has to be replaced with our syntax, such as replacing double-quotes with our literal markers:

{Visit {+strong style=(text-alignment: left; background-color: pink)}here{-strong}

Also, note that your examples do not include abusive use, e.g.:

{I am the starting string <strong>}
...
{and</strong> I expect to be concatenated to the starting string}

Finally, your examples are strictly in a single markup dialect. There are lots of pages assembled by what I called a "templating language" (think of stuff like mustache) where the string contains both HTML and "something else". The HTML markup handler wants to generate </>. Mustache, for example, uses {{ and }} for it's template commands. It's unreasonable to expect that people will not use their templating language in strings, such as:

Click <a href={{ myUrl }}>here</a> to do something {{ descriptionValue }}

Alas, what happens when the HTML and template language clash:

Click <a href="mumble">{{# strong }}here{{/ strong }}</a> to do something

{Click {+a href=(mumble)}{+strong}here{-strong}{-a} to do something

How does the MF processor know which handler gets the keyword strong?

I suspect we might be better off keeping MF out of the markup management business in favor of the markup protection business (using something vaguely XLIFF-like for the example):

{Click {+bpt id=a}<a href="mumble">{-bpt}{+bpt id=strong}\{\{#strong\}\}
    {-bpt}here{+ept id=strong}\{\{/strong\}\}{-ept}{+ept id=a}</a>{-ept} to do something}

(Of course that is a lot of work for the developer and fairly unattractive...)

zbraniecki commented 1 year ago

We played today with the difference between Elango's and my mental model and arrived to a reduced discrepancy:

With Markup as a AST node:

Open {+html:a}Firefox{-html:a}, which is a {+html:a}Mozilla{-html:a} {+ssml:loud}product{-ssml:loud}.

With markup as a placeholder node:

Open {:markup type="open" tag="a" ns="html"}Firefox{:markup type="close" tag="a" ns="html"}, which is a {:makup type="open" tag="a" ns="html"}Mozilla{:markup type="close" tag="a" ns="html"} {:markup type="open" tag="loud" ns="ssml"}product{:markup type={"close" tag="loud" ns="ssml"}.

They're not that different!

eemeli commented 1 year ago

@aphillips: Your description of the current "theory of markup" sounds correct to me. That presumption at the end is the nub of the problem. How does one ensure that +a and +strong are available?

I would presume that we would include in the registry an enumeration of the supported markup elements, and their valid options.

Presumably, the runtime that supported your HTML markup would consume the {+mumble} and {-mumble} tags and produce the correct string output. This could take the form of code that reads the input and produces (perhaps very) different markup as output.

While that's a possibility, I do not expect that in real-world use a runtime would necessarily be reproducing some other intermediary string representation of a message including markup, but rather building the appropriate output directly. Taking my first example message as a starting point:

{{:img key=icon} {+strong}{$name}{-strong} controls this setting.}

When formatted for display, this makes much more sense to be formatted as parts, rather than a string, resulting in something like:

[
  Image(key='icon'),
  Literal(' '),
  MarkupStart('strong'),
  Literal('ExtensionName'),
  MarkupEnd('strong'),
  Literal(' controls this setting.')
]

If the target is an HTML DOM as in the original, the Image would be matched with a separately defined <img>, and something like document.createElement('strong') used to handle the MarkupStart/MarkupEnd and its body.

That requires extra processing to make into valid MF2, because the quoting and perhaps other stuff has to be replaced with our syntax, such as replacing double-quotes with our literal markers:

That's correct, and unfortunate. My personal preference would have been for us to support XML syntax for markup, but this was not supported by the whole group.

Also, note that your examples do not include abusive use [and] are strictly in a single markup dialect.

That's right. I am not trying to claim that these are exhaustive, but they are actual messages we will want to be able to represent in MF2. It would be very useful if others could provide real-world examples from other sources which e.g. break start/end markup across messages; Mozilla's Fluent messages don't have any such.

Alas, what happens when the HTML and template language clash:
Click <a href="mumble">{{# strong }}here{{/ strong }}</a> to do something
=>
{Click {+a href=(mumble)}{+strong}here{-strong}{-a} to do something
How does the MF processor know which handler gets the keyword strong?

If this is handled within MF2, then I would expect the function registry to provide the answer. But that only works if the target is an intermediate string representation, which would require subsequent parsing. If the target is a more final shape, then MF2 only needs to provide a parts representation that'll be handled by some next phase that determines what to make of the markup parts.

I suspect we might be better off keeping MF out of the markup management business in favor of the markup protection business (using something vaguely XLIFF-like for the example):
{Click {+bpt id=a}<a href="mumble">{-bpt}{+bpt id=strong}\{\{#strong\}\}
 {-bpt}here{+ept id=strong}\{\{/strong\}\}{-ept}{+ept id=a}</a>{-ept} to do something}
(Of course that is a lot of work for the developer and fairly unattractive...)

Yes, that would be rather unergonomic.

echeran commented 1 year ago

It would be very useful if others could provide real-world examples from other sources which e.g. break start/end markup across messages; Mozilla's Fluent messages don't have any such.

We use a tooling layer between source documents and messages that generates our messages in a format-independent way. It leverages Okapi for its many benefits in this regard (ex: support for many formats & types of markup out of the box). The handler for the format/markup determines how to create text units and segments. It performs segmentation on the text units using ICU4J BreakIterator to subdivide them into segments, which correspond to individual messages.

So in the case of an HTML markup document, if it contains something like:

<span ...><i>You're viewing <b>Apigee X</b> documentation. <br/> View <a href="..." title="..." ...>Apigee Edge</a> documentation.</i></span>

then we ultimately might get 4 segments that are the equivalent of MF2.0 messages in our current syntax looking like:

{{:markup tag="span" type="open" ns="html" ...}{:markup tag="i" type="open" ns="html"}You're viewing {:markup tag="b" type="open" ns="html"}Apigee X{:markup tag="b" type="close" ns="html"} documentation.}

{{:markup tag="br" type="standalone" ns="html"} View {:markup tag="a" type="open" ns="html" href="[#$3]" title="[#$4]" ...}Apigee Edge{:markup tag="a" type="close" ns="html"} documentation.{:markup tag="i" type="close" ns="html"}{:markup tag="span" type="close" ns="html"}}

{https://the.url.from.the.a.tag/}

{The title text from the a tag}

Okapi will also assign unique ids per placeholder and unique ids per placeholder pair, which I'm eliding here because: 1) brevity 2) we hide from the translator such information in the CAT tool anyways, since it doesn't concern them

So it's natural for someone new to seeing this to question whether this is all a good thing, or if it is all too complex for what it's worth, since it's splitting up the inline ... element across different messages. Some quick answers in response:

Segmentation into a sentence per message is an appropriate unit of length for translators & tools to work at
Smaller messages increase the chances for Translation Memory leveraging
Allowing Okapi to process allows extraction of translatable attributes inside elements into separate messages, and linking them (ex: a[href], a[title], img[alt])
The creation of text units and segments works regardless of markup element nesting depth, and the HTML handler configuration indicates whether an element should be treated as an inline/text-level element (placeholder) or a block level (new message), etc.

I suspect we might be better off keeping MF out of the markup management business in favor of the markup protection business (using something vaguely XLIFF-like for the example):
{Click {+bpt id=a}<a href="mumble">{-bpt}{+bpt id=strong}\{\{#strong\}\}
 {-bpt}here{+ept id=strong}\{\{/strong\}\}{-ept}{+ept id=a}</a>{-ept} to do something}
(Of course that is a lot of work for the developer and fairly unattractive...)
Yes, that would be rather unergonomic.

FWIW, one of my personal takeaways from chatting with @zbraniecki the other day is a better personal appreciation of the subtle problems—bad practices for localization?—that can occur when users author markup like HTML directly into a MF message. I can also see users wanting to insert things like <a href="..." title="..."> and <img src="..." alt="..." /> directly in their messages, and translatable attributes of those elements will probably go overlooked... at least without tooling like Okapi or Fluent. Another example is what if a user authors a message that embeds a block-level HTML element in the message as if it were an inline-level element, not realizing the impedance mismatch? As a hypothetical example, imagine that a user is authoring the equivalent of this as a MF2.0 message:

Please reach out to your local shipping provider from the following list:
<ul>
  <li><a href="..." title="...">DHL</a></li>
  <li><a href="..." title="...">Canada Post</a></li>
</ul>

Even if we set aside how to represent this in MF2.0 syntax, the additional problem here is that you really have a block element (the <ul>) that contains what ought to be separate messages in their own right. What you have is more like an HTML fragment that needs to be broken down further.

So what I'm saying is that I think messages that use markup and authored by hand (ex: by developers) will be problematic and error-prone, unless there is tooling to validate (ex: linters) or to assist (ex: Okapi, Fluent). It feels analogous to the problem with current ICU MessageFormat allowing nested messages (which is tantamount to the rookie l10n mistake of string concatenation) in that these are subtle problems likely to reoccur. If we want to address the problem by strongly encouraging some type of tooling to assist and sidestep problems from handwritten markup, then I think the relative importance of human ergonomics or choosing <element>/{+element} vs. {:markup tag="element"} decreases. This is the reason for me asking the higher level question, "How much benefit do we gain by adding special syntax for markup at all?", although that question doesn't currently doesn't have an issue.

mihnita commented 1 year ago

{:markup type="open" tag="a" ns="html"}Firefox{:markup type="close" tag="a" ns="html"}, which is a {:makup type="open" tag="a" ns="html"}Mozilla{:markup type="close" tag="a" ns="html"} {:markup type="open" tag="loud" ns="ssml"}product{:markup type={"close" tag="loud" ns="ssml"}.

That is not how I see it :-)

This is how I have represented markup-as-placeholders before:

{+a :html type="open"}Firefox{-a :html}, which is a {+a :html}Mozilla{-a :html} {+loud :ssml}product{-loud :ssml}.

With :markup as a function name it means that everything is handled by the same function, and it is hard to modify if that is part of the library.

I will call "the engine" the implementation of the MF2 proper (with parsing, and runtime rendering). Some functions (the ones in the "official registry") might be provided by the engine. And the engine can't be modified by the developer using it. That is the case for an implementation if ICU which becomes part of iOS / Windows / Android. And it is true for a MF2 implementation provided by the browser.

With :markup as a custom function it means the dev must provide implementations for ALL supported kinds of markup. And if :markup is in the standard registry and I want to add my own tag, then I have to register :mihaimarkup, which can't be markup, must be placeholder based.

With them separate I can let the engine handle :html, and I can register my own function for :mihaimarkup, or :ssml.

So the proposal is simply markup = placeholder + open/close attribute.

If you go back to the proposal last year, that is exactly what it says.

I would argue that the two proposals can carry 100% the same information, but the placeholder + function has more flexibility.

A syntax like this is a little less verbose:

{{:img key=icon} {+strong}{$name}{-strong} controls this setting.}

But:

assumes html is default, and does not allow for simultaneous markdown to coexist

For example html:sub and ssml:sub

Why is a standalone markdown a placeholder, but open/close are completely different concepts (MarkupStart/End)?

If placeholders are good enough for standalone markup, they are also fine for open / close markup (with that extra info about open / close)

mihnita commented 1 year ago

If you think about ...{+sub :html class=foo}something{-sub:html}...{+sub :ssml}something more{-sub :ssml}..., it is very-very close to the regular xml / html with namespace. Only the order namespace - tag name is different.

Same flexibility (anyone can add a new namespace without messing with the deep down implementation).

And if we make the spaces optional (as was proposed today), the syntax is even closer to html.

mihnita commented 1 year ago

I know that for my argument 1. (no other markdown) Eemeli proposed something like {+ssml.sub} It would work, but that means one has to register ~100 functions to handle html instead of one.

And it still leaves us with the inconsistency between open / close markup and standalone as placeholder. Confusing developers ("Ok, so to support html I have to register a html.b for open / close, but :html for standalone?")

mihnita commented 1 year ago

Note: I am not arguing that the proposed markup with ssml.sub would not work.

But it is not needed (does not bring any benefit that I can see), introduces inconsistencies, and adds unnecessary complexities.

And we benefit from all the extras that placeholders offer us: It markup tags are placehoders, we can use the existing local variable mechanism:

let $button = {$a :html class=buttonStyle}

And then we can reuse it in several message variants (plural / gender / etc)

Note: in my comments I used markup (with lowercase) for the general concept or markup / html tags, not for the current ebnf syntax.

The current syntax is

Placeholder ::= '{' (Expression | Markup | MarkupEnd) '}'

(so technically Markup is a Placeholder)

aphillips commented 1 year ago

@echeran gave an example

Please reach out to your local shipping provider from the following list:
<ul>
  <li><a href="..." title="...">DHL</a></li>
  <li><a href="..." title="...">Canada Post</a></li>
</ul>

It would be unsurprising to me to see this being filled in using a templating language rather than static content:

   <c:forEach items=arrayOfShippingProviders var=arr>
        <li><a href="<%= arr.href %>" title="<%= arr.title%>"><%= arr.display %></a></li>
   </c:forEach>

Admittedly, putting the static string and the <ul> tag around this into a single message is poor I18N.

I think the allergy I'm developing to defining markup (no matter what we call it or whether it is using placeholders a la @mihnita above) is this: all of the markup support syntaxes here depend on a runtime processor and obligate the user to learn Yet Another Syntax to place markup into a message. They cannot just write their markup (HTML, SSML, or some templating language) directly. They must learn (or create!) a way for it to be expressed as some form of placeholder.

Any tooling that they use to write code using their templating language will not work to e.g. autocomplete the syntax. There is a possibility that the runtime of MF will generate invalid runtime-error-generating goo that will be difficult to debug.

Hence, my tendency to not want to get involved directly. Instead, provide a way to help translators do the right thing.

I'd even not require it. If you want to put an anchor tag in a string, well... that's probably not a great idea, but just type it into the string:

{This is a <strong>valid</strong> pattern that links 
    location to <a href={$location} title="here">here</a>}

And the generic solution would be to provide hints to translation tools about placeholders. Here I'm using XLIFF tags in the +/- placeholders:

{This is a {+bpt}<strong>{-bpt}valid{+ept}</strong>{-ept} pattern that links
    location to {+bpt}<a href={$location} title="{+sub}here{-sub}">{-bpt}here{+ept}</a>{-ept}}

Is it verbose? Yes. But I'm not required to write it. And it translates directly to translation tooling. No developer has to install a templating language registry and any templating language just works out of the box.

Is it possible to write bad messages this way? Yes. But developers don't need markup support to do that 🙈

macchiati commented 1 year ago

I suspect we might be better off keeping MF out of the markup management business in favor of the markup protection business

I was thinking about this a bit. If we get into the "protection" business, then we could be very neutral about whatever markup (or combination of markup) people want to use. Let's take the following example provided above.

Visit here

Easiest is if we just allow embedding of exactly the source, with a sigil, eg

Visit {@html }here{@html}

(The sigil doesn't need to be @, that's just for illustration.)

Now, the problem arises that we need to reduce as much as possible the need to escape characters within what we have embedded, as in the above:

Click <a href="mumble">{{# strong }}here{{/ strong }}</a> to do something

One possibility is that for such items we don't have a simple } terminator, but rather something that is very unlikely. (For illustration here, {@html@}, etc. Maybe even (horrors!) non-ASCII characters (but no emoji, please)). So

Visit {@html @}here{@html@}

We could have a registration of markup introducers like @html. We would hand over to the appropriate markup class the responsibility for determining which sequences of markup were disallowed. Thus if a translator decided to translate the above message as the following (out of order), a markup function would detect that, and the translation software could raise an alert.

Bsueche {@html@}daa{@html @}

That way we don't have to get into the business of determining which markup languages have nested structure, which markup is stand-alone vs initiating vs terminating, and so on.

As Mihai said, using 'let' can make messages far more readable, both because it can provide semantics that are more comprehensible to translators, and makes variant messages much more compact (and obviously parallel).


let @LINKSTART = {@html <strong style="text-alignment: left; background-color: pink">@}
let @LINKEND = {@html</strong>@}

Visit {@LINKSTART@}here{@LINKEND@}

aphillips commented 1 year ago

I was thinking about this a bit. If we get into the "protection" business, then we could be very neutral about whatever markup (or combination of markup) people want to use.

This is exactly my point.

eemeli commented 1 year ago

Looking at all the examples posted here and previously, it seems like they all are some flavour of XML or SGML, or ultimately represent XML/SGML content (i.e. markdown or templating languages). And the markup examples we're playing with either represent XML/SGML elements, or wrap them.

Why don't we use XML or SGML for markup? Like, for real. With relaxations for not requiring matching start/end pairs and allowing $var as values.

In previous conversations, the prime reason against this has been the subsequent need to quote < in text. But looking at the examples here, I'm starting to think that allowing bare XML to pass through as "just text" would be a mistake. If our markup is too onerous, that's probably what many developers would probably end up doing & then adding a post-processing parser.

I think this is pretty close to what @aphillips is suggesting, except that it doesn't require the {+bpt} and other tags anywhere.

macchiati commented 1 year ago

I disagree; we shouldn't have to quote < in literal text.

Now, if we want solely protection, and leave everything else up to the implementation, then the simplest approach would be to allow {<.*>}.

We would still want to be able to have that in let declarations, to provide for sharing among variants and for documentation.

On Wed, Mar 1, 2023 at 1:14 PM Eemeli Aro @.***> wrote:

Looking at all the examples posted here and previously, it seems like they all are some flavour of XML or SGML, or ultimately represent XML/SGML content (i.e. markdown or templating languages). And the markup examples we're playing with either represent XML/SGML elements, or wrap them.

Why don't we use XML or SGML for markup? Like, for real. With relaxations for not requiring matching start/end pairs and allowing $var as values.

In previous conversations, the prime reason against this has been the subsequent need to quote < in text. But looking at the examples here, I'm starting to think that allowing bare XML to pass through as "just text" would be a mistake. If our markup is too onerous, that's probably what many developers would probably end up doing & then adding a post-processing parser.

I think this is pretty close to what @aphillips https://github.com/aphillips is suggesting, except that it doesn't require the {+bpt} and other tags anywhere.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/356#issuecomment-1450856426, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFBNEZHUIMDEN4TWZDWZ632DANCNFSM6AAAAAAVCGJM4E . You are receiving this because you commented.Message ID: @.***>

aphillips commented 1 year ago

TL;DR: what is the requirement in MF for markup support? That would inform us better about what is needed. The only part of our goals that address this is goal 4:

Represent structured data alongside translations, such as markup, comments, and metadata.

But we are silent about this elsewhere. What do we want to do/need to do to meet this goal?

@eemeli noted:

Looking at all the examples posted here and previously, it seems like they all are some flavour of XML or SGML, or ultimately represent XML/SGML content (i.e. markdown or templating languages). And the markup examples we're playing with either represent XML/SGML elements, or wrap them.

A lot of us on this project work in the Web space, so it should be no surprise that most of our examples start with HTML or with XML dialects. But there are still plenty of application frameworks that are not like this. Some frameworks, like JSTL or ASPX, look like angle-bracketty tags, but not all do.

I have worked on a lot of frameworks and languages that render to non-HTML contexts or which didn't use angle-bracketty markup.

In previous conversations, the prime reason against this has been the subsequent need to quote < in text. But looking at the examples here, I'm starting to think that allowing bare XML to pass through as "just text" would be a mistake. If our markup is too onerous, that's probably what many developers would probably end up doing & then adding a post-processing parser.

Maybe the idea is that MF not understand markup because it is not the job of MF to handle markup. It is the job of MF to place stuff into pattern strings at runtime and to permit translators/translation systems to provide appropriately localized versions of those patterns. MF doesn't need to understand markup to do this. The resource format layer might provide support for markup protection and management (to assist translators) but that would perhaps be out of scope for MF?

I notice that this is similar to the reason we don't have character escapes in MF.

At the moment I don't think there are any cases where standard placeholders need to generate markup. A function, of course, could generate output that happens to be (for example) a set of bold tags around some text. But that's it's business, not the business of MF to know what the output character sequence "means".

Mark's suggestion about using let to simplify markup would still work, including maybe the ability to place values into tags (such as to support bidi):

let $strong = {|<strong style="text-alignment: left; background-color: pink" dir=|
                    {$here.dir}| lang=|{$here.lang}|>|}
let $_strong = {|</strong>|}
{Visit {$strong}here{$_strong}}

mihnita commented 1 year ago

Looking at all the examples posted here and previously, it seems like they all are some flavour of XML or SGML, or ultimately represent XML/SGML content

That is because we only talk about the examples posted here.

Issue #262

... smart enough and produce context-dependent results produce AttributedString in iOS, Spanned string in Android, and plain text in a console app. ... So html / tts / markdown / ansi_escapes are functions, not "bold" and "italic" and "link"

Issue #272

Some existing frameworks that don't use HTML for formatting, but still format things in UI:

Android : Spanned, which is used for formatting spans, but also for TTS and locale annotations, or custom spans.

iOS: AnnotatedString

Windows components usually build on top of RichTextBox

Issue #241

I would rather see this as literals + custom functions. So that from the same string one can generate html, spans (Android), AttributedString (iOS), ANSI escapes (command line)

In the EM proposal document from last year we mention

HTML, Markdown, MS Word and Powerpoint, OpenOffice, InDesign, FrameMaker, templating languages. the open-close concept goes beyond HTML, and XLIFF uses it for anything that implies structure or formatting (Markdown, Word documents, InDesign, etc.) Or something like Java AWT can create a button for “Login”, with everything wrapped in a FlowLayout.

mihnita commented 1 year ago

That way we don't have to get into the business of determining which markup languages have nested structure, which markup is stand-alone vs initiating vs terminating, and so on.

Unfortunately there are some areas where is beneficial for localization to tools to know what that tag represents, so only wrapping something like a "black-box" does not work well.

Many CAT tools have a concept of open-close tags, and they will warn or prevent if one changes the order (close before end).

The other (bigger?) problem is that some attributes in the tags need localization. Some examples: a: [title, accesskey, download] img: [title, alt] button: [accesskey, value] input: [alt, value, accesskey, title, placeholder]

I can provide a complete list.

And example from W3C (https://www.w3.org/WAI/WCAG21/Techniques/html/H33.html):

<a href="http://example.com/WORLD/africa/kenya.elephants.ap/index.html" 
   title="Read more about failed elephant evacuation">
   Evacuation Crumbles Under Jumbo load
</a>

I can also imagine exposing attributes like dir which are not "translatable", but we might allow translators to force them to ltr, rtl or auto (an enum).

Visit {@html @}here{@html@}

We could have a registration of markup introducers like @html. We would hand over to the appropriate markup class the responsibility for determining which sequences of markup were disallowed. Thus if a translator decided to translate the above message as the following (out of order), a markup function would detect that, and the translation software could raise an alert.

Once we get to register markup inducers, they become very much like the functions in placeholders. And if we admit that tooling can benefit (or must be able to) look at the attributes, or even change them, the attributes become very much like the parameters in placeholders.

Visit {@html <strong style="text-alignment: left; background-color: pink">@}here{@html</strong>@}

is very much like placeholders with the +/- markers for the open / close concept:

Visit {+strong :html style=(text-alignment: left; background-color:pink)>here{-strong :html}

So one can even have some attributes come from outside:

Visit {+a :html class=foo id=bar url=$asParam>here{-strong :a}.

It is very much like a function with a literal:

...{(1234) :number}... or {(2023-02-27) :datetime skeleton=yMMMd}...

So I don't think there are fundamental differences between the two concepts.

In fact even the people arguing for separate markup still proposed to use placeholders for standalone elements:

...{img :html src=foo.jpg}...

So placeholders are good enough for standalone, but not for open-close?

One reason I can imagine to treat "markup" as different is because we want to differentiate between "things that end up visible on screen as text" vs "things that are invisible but affect formatting".

I don't know if this helps anyone, translators or developers. But maybe we can do that. Either at syntax level, or even data model.

But I think a lot of things from placeholders will also apply to markup elements. So in an implementation I would probably inherit :-)

I tried to show that placeholders and the proposed markup are very-very similar. To the point where even if we don't have markup at all we can implement them with custom functions.

I've seen all kind of arguments about syntax. But I am still to see a good argument for why do we think we need them at all? How are they fundamentally different than placeholders?

eemeli commented 1 year ago

Looking at all the examples posted here and previously, it seems like they all are some flavour of XML or SGML, or ultimately represent XML/SGML content

That is because we only talk about the examples posted here.

Looking through the issues you linked to, I only find further examples of XML-ish markup. Would you have any examples of messages with non-XML-ish markup that we should support? Taking such targets deeply into account is much easier with examples. If it's hard to find them, maybe we should focus on having good support for XML content in particular?

I've seen all kind of arguments about syntax. But I am still to see a good argument for why do we think we need them at all? How are they fundamentally different than placeholders?

I would think that this is large part answered by yourself, here:

Many CAT tools have a concept of open-close tags, and they will warn or prevent if one changes the order (close before end).

Markup elements allow for the clear representation of localisable content that is not a flat sequence of parts. As you yourself point out, the starts and ends of markup depend on each other. They have an effect not only on the exact point in the message where they occur, but also on the content between them. By having clear syntax indicating such starts and ends, we make it easier for anyone and everyone working with a complex message to understand what shape it has.

This is also why I think our placeholder syntax may be sufficient for representing standalone elements. If it's important to indicate that a placeholder is representing a standalone element, this could easily be achieved with a common prefix:

{{:html.img key=icon} {+html.b}{$name}{-html.b} controls this setting.}

If we were to support XML directly, then it would of course make sense to support standalone elements as well:

{<html:img key="icon"/> <html:b>{$name}</html:b> controls this setting.}

aphillips commented 1 year ago

@eemeli

Would you have any examples of messages with non-XML-ish markup that we should support?

Consider:

Mustache https://mustache.github.io/mustache.5.html
Jinja https://jinja.palletsprojects.com/en/3.1.x/templates/
any number of these: https://colorlib.com/wp/top-templating-engines-for-javascript/

I can't quote, obviously, any of Amazon's peculiar formats, but some of them are similar. And there are a variety of angle-brackety formats (such as JSTL, PHP, etc.) that also exist and which might not be quite the same as HTML/XML type syntaxes.

@mihnita

Unfortunately there are some areas where is beneficial for localization to tools to know what that tag represents, so only wrapping something like a "black-box" does not work well.

Yes, totally.

However, a starting point of "the templating goo is just characters in the pattern" does no harm as long as the translation tools still segment and process the markup. And we can add hinting (so that the developer can indicate where the markers are and what their relationship is) without MF itself knowing what the markers mean at runtime.

If we were to support XML directly, then it would of course make sense to support standalone elements as well:

{<html:img key="icon"/> <html:b>{$name}</html:b> controls this setting.}

This requires the developer to change their tags and markup (and hope that the runtime will generate the correct results).

Maybe a different choice would be to allow the message to be "marked up", they way text editors can be instructed which format to use for syntax highlighting/formatting:

{{+markup html}<img key="icon"> <b>{$name}</b> controls this setting.}

And allow for multiples:

{{+markup html mustache}<img key="icon" href="\{\{someMustacheVal\}\}"> <b>{$name}</b> controls this setting.}

MF doesn't "understand" the markup in the formatter runtime, but translation systems know what to expect. Local markup handlers can be installed for proprietary goo.

zbraniecki commented 1 year ago

However, a starting point of "the templating goo is just characters in the pattern" does no harm as long as the translation tools still segment and process the markup.

How would you imagine that working for formatters that would return markup elements? Or elements passed into the MF2 formatter as arguments to be placed in the output?

stasm commented 1 year ago

Let's recalibrate on the goals here. @aphillips has asked:

What is the requirement in MF for markup support? That would inform us better about what is needed.

I think the discussion so far has revolved around the following two:

At runtime, allow systems built on top of MessageFormat to decorate translations with markup constructed on the fly in correct positions of the translation.
In CAT tooling, allow extra protection and checks. For example for some types of markup, the closing tag must come after the opening one; for other types, elements must be nested properly, rather than in an overlapping fashion, etc.

For (1), there seem to be two general approaches:

Allow markup to pass through so that it can be consumed by a higher-level abstraction calling into MessageFormat, and parsed and constructed there. This is the blackbox approach, with potential new markup introducer syntax: {@html @}...{@html @}.

This approach is capable of being used with many different markup syntaxes, and has the benefit of using these syntax verbatim, in a form that's most familiar to developers.
Encode markup in MF's syntax to avoid extra parsing and make MF aware of the contents of the placeholder: {+strong}...{-strong}.

In this approach, it may also be possible to benefit from features like variable references: {+strong title=$userName}...{-strong}, although admittedly this would be currently rather limited because we don't allow patterns with interpolations as option values, (title={Hello, $userName}) nor do we allow interpolation inside literals (title=|Hello, $userName|)

This approach is also very close to our current function call syntax, in terms of its expressiveness. As other have noted, there isn't much different between {+strong title=|Hello|}...{-strong} and {|strong| :html.open title=|Hello|}...{|strong| :html.close}.

For (2), it seems that it would be enough if CAT tools were able to:

recognize certain placeholders as markup,
recognize markup placeholders as open, close, and standalone markup, and
know which rules apply to individual markup placeholders.

Note that all of these can be satisfied through a sufficiently capable function registry. For example (the schema is TBD):

<function id="html.open">
  <signature type="markup" tag="open">
    <!-- When the argument is |strong|... -->
    <input value="strong"/>
    <!-- ...only accept the following options: -->
    <param name="title" ... />
  </signature>
</function>

aphillips commented 1 year ago

Thanks @stasm.

Our syntax says that completely blackbox patterns are explicitly allowed. That is, this is a valid pattern:

{I have <strong>feelings</strong> about my <em>patterns</em> <img href="{$seeNoEvilMonkeyLocation}">}

Translators and CAT tools have to deal with the fact that this pattern appears to have HTML in it. We can't forbid pattern strings like this because otherwise people couldn't write HTML tutorials or certain math expressions (and because our syntax says this is valid).

In CAT tooling, allow extra protection and checks. For example for some types of markup, the closing tag must come after the opening one; for other types, elements must be nested properly, rather than in an overlapping fashion, etc.

We could add syntax to MF to support this, because clearly developers need to include markup-like tokens (such as HTML or other templating stuff) into strings and translators might have to deal with it. Providing hints to CAT tools and MT engines about the markup will make for more robust translations, which benefits developers (fewer bugs to chase).

I think interesting questions are:

What requirements (could) exist at runtime for the formatter to need to know about the markup?
How much does the formatter need to know about the markup?
What benefit is there to developers in using our syntax to provide that information?

I can think of cases where a developer calling formatToParts might want markup to be treated as a "part" separate from the rest of the pattern string. The formatter would need to know where that markup was and know what attributes it had. Otherwise it is just some characters in the output stream.

For example, a platform I know has a widget set. There can be specific tags that the widgets are attached to at render time. The widget code wants to read the "parts" inside the widget to get field order and specific field contents (for example).

Note that all of these can be satisfied through a sufficiently capable function registry.

Agreed. And there can be registries used by CAT tools, by the runtime, or both... and these could be separate, e.g. the runtime knows nothing but the CAT tools have extensive knowledge of the markup. You could say that this is the state today 😀

Encode markup in MF's syntax to avoid extra parsing and make MF aware of the contents of the placeholder: {+strong}...{-strong}.

How does this avoid "extra parsing"? Extra parsing by whom? It appears to add parsing to MF's workload, no?

What does the MF runtime do with information about the contents of the placeholder?

Using {+strong} as an example, obviously it has to generate the actual output. This can have some benefits. For example, in HTML, inline elements with a dir attribute are bidi isolated, so instead of generating isolating Unicode controls, we can use markup to accomplish our bidi requirements. So this pattern:

{You have {+strong}{$currencyAmount}{-strong} remaining on your gift card}

... could render in the ar-AE locale using the currency AED as:

You have <strong dir=rtl>؜١٬٢٣٤٫٥٠ د.إ.‏</strong> remaining on your gift card

Otherwise the developer has to do the work to obtain the directionality and insert it:

{You have <strong dir={:getDirection $currencyAmount}>{$currencyAmount}</strong> remaining...}

I guess my concern is that there are lots of good things we can do, but we can't force developers to use our syntax and they can mix and match. The availability of a "bad practice" doesn't make our syntax wrong.

cdaringe commented 1 year ago

Interesting discussion!

What is the requirement in MF for markup support?

@aphillips, my gut tells me that this is (maybe?) somewhat of a secondary question. I'm hunting through these issues, trying to to see the roots of the markup discussion. I cordially ask, could the primary question more accurately be:

How shall MFv2 robustly allow for the enhancement of MFv2 messages?

...where enhancement could be loosely defined as "customization of the resultant message to add styling, accessibility, or other functionality within the target runtime". Markup is one mechanism to achieve that end-goal. However, markup does not serve many users in the MF userbase (re: #272).

Supported Runtimes or Devtools	MF Text Enhancement Feature
HTML, XHTML, XML, SVG, markup langs, templating engines (which all serialize to an aforementioned format)	`markup`
iOS, Android, ...most-native app toolkits, ...most-browser-side-web-frameworks, ...most-game-dev-frameworks	🚨 none/missing

If we support markup, we do so knowing that it has limited utility. For huge swaths of MF interested users, it has no use at all. @mihnita has been saying the same thing more or less, this time, last year. I think it's a pretty tough argument to make that MF should implement markup, given how low utility it is in so many programming contexts. I do however, deeply desire the capability that markup offers, but I seek it via a mechanism that works for everyone.

I weakly hold the position that if we want to support a goal of enhancing translated text, we should do it once, do it well, and support everyone. Eemile pointed to MessageValue as a standard output of MF libs that could be a "works everywhere" way for every runtime to apply markup-like-enhancements. It does not offer formatting directly like markup does, but it offers an intermediate representation that let's us achieve the same goals.

Fwiw, eemile explicitly disagrees with making this output part of the MF spec. I respect his opinion, so that gives me pause. However, if we agree that all users should be able to apply enhancements, should we really be including a redundant feature--markup--at all? My current stance is no. We shouldn't offer markup for enhancements because it's not x-platform. We should offer some formalism that allows enhancements everywhere.

eemeli commented 1 year ago

If we support markup, we do so knowing that it has limited utility. For huge swaths of MF interested users, it has no use at all. @mihnita has been saying the same thing more or less, this time, last year. I think it's a pretty tough argument to make that MF should implement markup, given how low utility it is in so many programming contexts. I do however, deeply desire the capability that markup offers, but I seek it via a mechanism that works for everyone.

The current solution we've ended up with attempts to be sufficiently minimal to not impose an undue cost on users that have no need of markup: Effectively, formatting functions may use one of three starting sigils :, +, -, and the exact meaning of the differences between these sigils is effectively left to the implementation or the user to define.

@cdaringe, do you think that this is too much to include?

zbraniecki commented 1 year ago

@cdaringe you keep repeating a claim that markup does not work for many UI toolkits which is statement directly contradicting our position that markup is designed to support all UI element types for all systems including Android, iOS, React Native, etc.

I really believe that instead of investing time to double down on implications of your claim, you should document why you believe markups will not work for those toolkits. If you are correct, we should fix it. If you are wrong, the rest of your argument is misguided.

cdaringe commented 1 year ago

edit: I've collapsed my distracting comments below. Fundamentally, conversations in the working group have had ambiguity over the meaning of term "markup", and my comment is me not understanding the context for which this conversation was happening. I kept it there, but it's just noise 😄

> @cdaringe, do you think that this is too much to include? Hey eemeli, point taken. I interpret your point internally as "Hey, the spec here is minimal, it's open/flexible, it's really low cost of ownership, and likely empowering to many users". That's my expanded interpretation--and if i'm somewhat close to your intent--it's a fair point! I _would_ still posit that markup support is too much to include (with again, a weakly held stance). I'm thinking and typing simultaneously--here are a few reasons that came to mind against inclusion: 1. **churn**. Inclusion invites churn from user space. In practice, for example, in teams that I work with, any time a source translation file is changed, it triggers a wholistic translation process. If I added `id=foo` to `Greetings, {friend}!`, in most teams I partner with, that would trigger a translation push and pull cycle. I think it's an avoidable hazard. That is extremely subjective, I am aware, but based on experience, it would happen for many teams. Is it MF's role to concern itself with this? Maybe not, but we could certainly prevent it in favor of something better! 2. **dev-ux**. Inclusion couples runtime-specific-text-enhancements inside the translation system, vs inside of the runtime system. If I'm a swift-ui developer, I'm used to enhancing content with swift-ui tools. Sure, maybe there's some markup processor that my translation team uses that pre-processes stuff for swift-ui, but now I'm getting enhancements from _two_ sources, and the pre-processor could also be entirely avoided with a better MFv2 promoted design (e.g. MessageValue processing). We could all easily keep runtime enhancements out of our translation system and only keep enhancements within our runtime system. Capabilities are also diminished. If i needed to bind a programatic reference inside of one my enhancements, such as a converting some text into a clickable button with a callback function, how am I to represent that in the markup? Keeping enhancements in the runtime is really much more flexible than a flexible markup DSL. 3. **empathy**. Our ICU MFv1 translators do not like translating source with markup in it. Their tools _help_ them sift through this, but it creates a high noise:signal ratio in some circumstances. For instance, we have some styled email templates that are markup rich that my translation partners loathe getting patches to. 4. **universality**. if it doesn't work for everybody, pick something else. We can keep the spec tighter, and instead have a sound universal solution. Points 1 & 4 are a bit weak, but i think 2 & 3 have some real merit. software is art 🎨 . Please interpret my massive response just as exploration of the topic, not as passionate argument against😄 . > you keep repeating a claim that markup does not work for many UI toolkits Forgive me, I just perceived this to be clear on the surface. I'd love to see any evidence on how markup _could_ work _pragmatically_ with other toolkits. In lieu, I present an example. Let's suppose that I own and operate the following apps: 1. an android app, using compose, 1. an android app, using vanilla android provisions 1. an ios app using swift ui 1. an ios app using vanilla ios provisions, 1. a web app using vue, 1. a web app using react, 1. a web app using vanilla-js/html Great! Now, my company WhizBangCorp™ needs to up update their terms of service in their various UXes. The "foo bar baz" warning wasn't **bold** enough. Sure. Ok, every dev team now needs to go make something bold. Markup is the ICU MFv2 text enhancement strategy. It's not coupled to web, can it work everywhere? If we need to make something bold, surely we can use markup! Let's try it: Old message: `foo bar baz `. Input string: `foo {+b}bar{-b} baz`. Question: What is (likely) rendered on 6 of 7 of my UIs? Answer: the markup is _literally_ rendered, not processed. All of the aforementioned toolkits, besides the native html version, do not _natively_ process markup. I make a logical leap here, and assume/imply that the MFv2 libs in these ecosystems offer _similar_ APIs to those that exist now for MFv1. One could say, either: 1. "hey! you used the wrong type of markup! you need to use markup that is tailored to each platform!" 1. Markup processing is not part of the default behavior for any of these extremely common toolkits. 1. a markup pre-processor would need to be added in front of each call. 6. pre-processing markup would be a subjectively poor design for all of these platforms 7. it is totally avoidable subsystem. a standard API output type like `MessageValue` would void the need for any markup parsing. 8. markup processing possibly incurs non-trivial performance penalties 9. using markup requires shimming support in for non-native DSL. e.g. in android, maybe i'd need to parse `pug` syntax and convert it to `Spannable` 10. I cannot imagine mobile devs wanting to to apply any markup processing for their formatting needs. 1. "It's not up to MFv2 to define the APIs allowing each platform to apply enhancements, including how to process any markup" 1. Then why offer markup at all? MFv2 markup support is unambiguously allowing **runtime enhancements** into the MFv2 grammar. MFv2, in my opinion, is saying "hey, we're not a UI toolkit. buuuuut when you are compelled, feel free to embed a bunch of your runtime content in your translation content. If you're runtime doesn't allow you to embedded your enhancements within our DSL, you're out of luck." Woof, that was a serious wall of content. Hopefully it clarifies my perception.

mihnita commented 1 year ago

@mihnita https://github.com/unicode-org/message-format-wg/issues/272#issuecomment-1133448343 more or less

Yes and no.

To clarify: I don't thing supporting "markup" is very useful if by that we mean "real markup" (as in #, ##, *..*) and so one. Or markup-like tagging (like html).

But I think it is very important to support a way to tag sections of a string in certain ways. That can result in visible style (bold, italic, links, images) or not (marking a fragment with hints for tts).

That information should be accessible to translators, because they have to make sure the proper words are tagged. And to all translation tooling (for validation, leveraging, etc)

If that is what we call "markup" then we really need something to support it. I've seen way too many hacks trying to "fake" that support with MF1.

I would argue that these days a lot of frameworks use something like that. I mentioned many ways the Android Spanned, then macos AttributedString.

So I imagine that an Android implementation of MF2 would produce a Spanned object when you formatToParts, a macos implementation would produce an AttributedString, and so on.

This is in general pretty well supported by localisation tools, with the "html-y" part well hidden. XLIFF has the concept of open-close tags, Trados supported the concept well before XLIFF was a thing. And the translators didn't have to see html ("markup"?). (is configurable, see for example https://usercontent.one/wp/multifarious.filkin.com/wp-content/uploads/2012/09/61.jpg)

I've seen those primitive building blocks (open/close/standalone "tags") to translate HTML, makdown (the real thing), FrameMaker, MS Word / Excel / PowerPoint, Javadoc, doxigen, php.

And I am quite sure it can handle items 1-7 in the previous comment.

I am open to add something for "markup" separate from placeholders, but I think:

It is a waste, because it would mirror very closely the placeholder functionality
It will need a standalone tag. It is inconsistent to say "for bold / italic / link use markup, but for horizontal lines or images use placeholders". Why is <hr> placeholder, but ... markup?
The concepts can overlap a lot and are very much coupled. A date formatter can produce:
- plain text ("12/07/2023")
- plain text ("12/07/2023") annotated with tts tag indicating that the fragment is a date
- plain text ("12/07/2023") annotated with tts tag indicating that the reading is "December seven, one twenty twenty three"
- plain text ("12/07/2023") but with different style and clickable, which opens a date input widget So when I produce a formatted text + link (or tts with alternate text), is that markup, or placeholder?

TLDR:

I think we need "something" that can represent annotated sections of text (not necessarily visible, can be tts)
Can be handled by standard placeholders, but they need concept of standalone / open / close. That was even part of the proposal to the Unicode TC more than one year ago.

That is the meaning when I say "we don't need markup". It means: not needed as a mechanism separated from placeholders.

Maybe the term "markup" is confusing?

cdaringe commented 1 year ago

i'm using markup as my interpretation of markup. i perceive this is what the others are referring to as well. zibi & eemeli are asserting that this WG supports the most abstract/open as possible markup, no particular flavor.

To clarify: I don't thing supporting "markup" is very useful if by that we mean "real markup" ... very important to support a way to tag sections of a string in certain ways.

yes and yes! this is a fine concise synopsis :). markup alone doesn't get us the tagging we need for other platforms' text-enhancement needs.

eemeli commented 1 year ago

Great! Now, my company WhizBangCorp™ needs to up update their terms of service in their various UXes. The "foo bar baz" warning wasn't bold enough. Sure. Ok, every dev team now needs to go make something bold.

Markup is the ICU MFv2 text enhancement strategy. It's not coupled to web, can it work everywhere? If we need to make something bold, surely we can use markup! Let's try it:

Old message: foo bar baz. Input string: foo {+b}bar{-b} baz.

Let's take this hypothetical business case as a starting point. What's your proposed alternative for fulfilling the ask of "make bar bold" in all locales, across multiple platforms? Keeping in mind of course that in some, the order might change to something like föö bäz baari.

cdaringe commented 1 year ago

A stripped down, constrained version of the status-quo MFv2 markup syntax.

MFv2 (current): foo {+bar}bar{-bar} baz
MFv2 (proposed): foo {@bar}bar{/@bar} baz
# MFv2 (proposed): föö bäz {@bar}baari{/@bar}

In terms of resolveMessage, creation of a spec-blessed MessageMarker, in exchange of MessageMarkup. If a message can be considered a 1-dimensional collection of ordered UI parts, a MessageMarker is a named index/position in that ordering.

In this minimal example, the markup messages and marker messages were semantically indistinguishable, so it is not very interesting.

A more interesting example:

MFv1: Click <a href={url}>here</a> to continue
MFv2 (current): "{Click {+a href=$url}here{-a} to continue}"

# allowed, preferred from user-space
MFv2 (proposed): "{Click {@a}here{/@a} to continue}"
# still allowed, less preferred from user-space
MFv2 (proposed): "{Click <a href={$url}>here</a> to continue}"

Q: Where's the URL in the first MFv2 (proposed) example? A: Intentionally removed. Where markup permits value/options, a marker does not--text enhancement must happen in application code, using the runtime of interest.

Q: How do we apply enhancements/formatting/TTS/accessibility-hints in application-space runtimes? A: See here. It's wonderfully written, and is the exact solution we need in all MFv2 contexts, not just in JS. Runtime bindings may now gracefully author runtime compatible MessageValue processors. Promoting MessageValue into the MFv2 specification as a well-known format promotes MFv2 capability to more programming contexts.

Supposing that MessageValue was indeed promoted into MFv2, here's what runtime consumption could look like. I've used three runtimes (html/js, react/js, and gtk/ocaml) as three valid MFv2 consumers who need string/markup, React.Node, and GtkText.text primitives respectively to apply enhancements in their associated runtimes. In the following examples, an arbitrary marker, "a", is used to create a span-region to apply link formatting/decoration. Hidden from the below snippets are resolveMessage implementations and map-reduction of the MessageValue innards to map captured "a" MessageValue nodes into their enhanced versions. source code here

// demo: HTML target runtime
// runtime-demos/js/html/index.mjs
const url = "https://example.com";
const html = htmlBinding.fmt("en", "{Click {@a}here{/@a} to continue}", null, {
  a: (children) => [`<a href="${url}">`, ...children, `</a>`], // html: string productions are fine!
});
console.log(html); // Click <a href="https://example.com">here</a>

// demo: React target runtime
const url = "https://example.com";
const MyReactComponent = reactBindings.fmt("en", "{Click {@a}here{/@a} to continue}", null, {
  a: (children) =>
    React.createElement( // react: React.ReactElements are required
      "a",
      { href: url },
      children.map((c) => (c instanceof MessageValue ? c.toString() : c))
    ),
});
ReactDom.renderToString(MyReactComponent); // Click <a href="https://example.com">here</a>

(* demo: ocaml + GTK target runtime *)
module MF = struct
  (* (text, gtk_tags_to_decorate_partial_buffers *)
  type formatted = (string * string list)
  let fmt locale msg data ?(enhancer_map=GtkEnhancer.empty) (buffer:GText.buffer) =
      resolve_msg locale msg data
      |> snip_snip_snip
end

let fmt_link children: formatted list =
  let fn (text, tags) = (text, List.cons "link" tags) in
  List.map fn children
let enhancer_map = GtkEnhancer.of_list [("a", fmt_link)]
let fmt_gtext = fmt ~enhancer_map "en" "{Click {@a}here{/@a} to continue}" None
buffer#create_tag ~name:"link" [`FOREGROUND "blue"] |> ignore;

fmt_gtext buffer |> ignore;
w#show ();

I added OCaml + GTK as perhaps less common toolkits in order to demonstrate how promoting MessageValue into the spec could make MFv2 very widely applicable. E.g. if it works here, it could work anywhere.

stasm commented 1 year ago

@cdaringe Thanks for taking the time to explain your thinking in detail. Two thoughts:

Given the current spec, I can imagine custom open/close functions which don't format anything but instead emit message "markers". Semantically, there isn't a difference between {Click {@a}here{/@a} to continue} from your example and {Click {+a}here{-a} to continue} which is the current MF2 syntax. I don't think we need to explicitly ban options to such functions (like the href).
IIUC, the API you suggested implicitly requires that open/close markers be both present in the same message and be properly nested. This strikes me as a very XML-oriented requirement. In fact, the WG explicitly agreed to allow both of these to be false. How would you imagine the HTML/JS example to look like if instead of {Click {@a}here{/@a} to continue} we had two messages: {Click {@a}he} and {re{/@a} to continue}? (Not a good i18n practice for such a message, but realistically it may happen in longer ones.)

cdaringe commented 1 year ago

hey @stasm,

I don't think we need to explicitly ban options to such functions (like the href).

perhaps not. i'm going to pause my opposition as I do a bit more research and practice using the beta impls a bit more in order to make more informed claims.

the API you suggested implicitly requires that open/close markers

...ehh it wasn't intended to do so :). the fact that I closed them was really just for ease of review/grokking the concept, not a strict requirement.

aphillips commented 1 year ago

Closing resolve-candidates per discussion in 2023-07-24 call