Extendable inline markup

unicode-org / message-format-wg

Developing a standard for localizable message strings

Other

228 stars 33 forks source link

Extendable inline markup #26

Closed dchiba closed 11 months ago

dchiba commented 4 years ago

This is another requirement nominated and briefly discussed in issue #3 , for the ability to insert inline metadata such as comment over a specific span of the message.

This is useful to clarify what a translation note is referring to and clearly communicate it to the translators.

For example (the markup with [] is for illustration of the concept only):.

Let's [[environmentally conscious]go green].

@nbouvrette mentioned "extendable inline" and inline comment could be thought of an instance of it. There can be many other ways to use it, like designating an untranslatable span, for another example:

Type [[translate='no']history] on the command line.

Internationalization Tag Set (ITS) from W3C defines various "data categories" of the information that can be set on a span for automated processing of human language. As a matter of fact, comments are in the "Localization Note" category and whether a span is to be translated or not is in the "Translate" category.

In some cases, the metadata may apply to the whole string. I assume that is covered by the bullet that says 'Messages should have more context “description” or ”metadata”'.

@grhoten commented:

I have had mixed results with this. Some translators translate the comments too, especially for first timers, and they don't realize that the final message recipient won't see them, which wastes translation time. There are other times when there is information that is best conveyed inline. Sometimes the comments get in the way of readability. I can see the pros and cons of such functionality.

@nbouvrette responded:

+1 on your comment - there are other ways to provide comments (typically called context) to linguists which handled correctly today by most TMSes. If we need inline context, there might be something too complex with the syntax.

I agree there are pros and cons, and the added complexity in the syntax is a negative factor. However, there are a number of ways to help meet various needs and wants using inline markup, without causing too many drawbacks.

Translators' mistakenly translating comments can happen only if they worked directly on this syntax. CAT/TMS tools could provide a good UI for the translators that would allow them to work with little or no knowledge of this syntax.

Inline markup allows the tools to process the strings mechanically for quality results. An instance of taking this advantage is Google ARB, which uses a simple markup to keep untranslatable spans untranslated as follows:

Hello {@<b>}World{@</b>}

This notation masks the \ tags from the risk of getting translated unexpectedly. With a markup like this, it is possible to guarantee no inadvertent translation of untranslatable words or phrases.

Ruby Annotation is another kind of metadata that the message author may wish to set on a specific span in the message. Inline markup would be a suitable way to set ruby text on a message.

dchiba commented 4 years ago

Renaming, originally from "Inline comments / extendable inline" to "Extendable inline markup" to better reflect the nature of this requirement.

mihnita commented 4 years ago

One of the problematic areas with translating software strings is working around placeholders. So introducing any extras just to document things only makes thinks more complicated.

My take:

I think there is very rarely a need to annotate an exact region of the message.

We should have a way annotate the messages

We should have a way to annotate the placeholders

We should have a way to tag regions as non-translatable

But we should not mark an area of the message only to add a comment to it. It only introduces "code noise" in the text. If need be one can use the message description to say something like "the 'foo bar' in this message means xyz, so make sure that ABC".

dchiba commented 4 years ago

Let me quote the definition of "Terminology" data category of ITS:

The Terminology data category is used to mark terms and optionally associate them with information, such as definitions. This helps to increase consistency across different parts of the documentation. It is also helpful for translation.

It is a common desire to translate terms consistently.

Another use case of managing localization quality using inline markup would be collecting and reflecting end user feedback. In doing so, it would be essential to identify a specific word, phrase, sentence, etc. and attach metadata for the correction. How could software collect and make the fix to reflect the feedback?

ITS has data categories for quality management. Example 72 shows how a possible misspelling may be annotated.

I think inline markup would help produce SSML and quality MT output as well.

nbouvrette commented 4 years ago

I think there is very rarely a need to annotate an exact region of the message.

I agree. If there are cases it might be good to illustrate them because I have not seen any example so far in my current environment where this would be required.

But there is indeed a need for context which seems to be one of the focus of TMSes. The "new cool feature" that a lot of TMS vendors are selling these days is "In-context translation" which would probably provide even more context than inline markup?

Another use case of managing localization quality using inline markup would be collecting and reflecting end user feedback. In doing so, it would be essential to identify a specific word, phrase, sentence, etc. and attach metadata for the correction. How could software collect and make the fix to reflect the feedback?

I think you can manage this outside the translatable assets (in the TMS) as well. Pontoon already does this if I'm not mistaken but I have also seen this in other products.

Also regarding feedback loops, would inline markup provide anything that AutoML APIs won't be able to solve today?

mimckenna commented 4 years ago

I think there is very rarely a need to annotate an exact region of the message.

I also agree. Especially if the resulting message format is to have large acceptance with existing content management, localization management, and MT systems. Content creators are pretty good at adding clarifying statements as a separate comment, indicating usage, etc.

We use glossaries and context to preserve individual terms that should not be translated, or include them through dynamic placeholders so the translator never has a chance to translate them. e.g. (for illustration purposes only)

"cmdString": { "message":"Type '{historyTerm}' on the command line.", "description": "User types the English term 'history' to run the history command." }

Then from a repo of terms not to be translated "historyTerm": "history"

dchiba commented 4 years ago

I think you can manage this outside the translatable assets (in the TMS) as well. Pontoon already does this if I'm not mistaken but I have also seen this in other products.

TMSs support untranslatable spans, terms and other special elements in the message today. They must be using some sort of inline markup. I think I am proposing to make these features available to message authors through extendable inline markup.

By enabling the message authors to compose the message using a standard inline markup, it would be easier to process the messages as defined in the original message. Theoretically, they would become portable across different systems and it would be easier to achieve the desired roundtrip mapping to and from XLIFF as well.

'{historyTerm}' in Mike's example is an instance of inline markup.

28 would be yet another instance of inline markup for proper rendering of a bidirectional message. Generally, inline markup is a preferred solution to using direction control characters.

regarding feedback loops, would inline markup provide anything that AutoML APIs won't be able to solve today?

If I understand correctly, AutoML API is for machine learning. It doesn't seem to be concerned about the lifecycle of translatable strings. AutoML Translation applies machine learning to translation. AutoML Natural Language provides content classification, entity extraction, and sentiment analysis. I think it is desirable for this working group to look at how user feedback may be handled through a machine translation cycle. e.g. The suggestion could be expressed using the message format syntax, which should be dispatched to a machine translation engine whose output should be fed back to the original application and the affected messages get updated.

nbouvrette commented 4 years ago

I think I am proposing to make these features available to message authors through extendable inline markup.

I think there is already several markups and file types that can support context today. For example XML:

<entry key="contact-us-submit-button" context="This is the text used on the submit button in the contact-us form.">Submit</entry>

(HTML will have the same type of context support by most TMSes)

You can even add context in .properties files:

# This is the text used on the submit button in the contact-us form. contact-us-submit-button = Submit

I'm not sure I see a lot of benefits by embedding this into the syntax itself mainly because:

1) Context is one of the core features of commercial TMSes/CAT tools and I believe there is already good ways to provide this today (outside syntaxes) 2) The main benefit I see from using syntaxes like MessageFormat is to solve linguistic challenges (e.g. plural) - and we still have plenty of big problems (e.g. inflections) in this space which have yet to be solved

I might be biased into focusing on solving the linguistic problems but see more the context part of the translation process a responsibility of the TMS.

The suggestion could be expressed using the message format syntax, which should be dispatched to a machine translation engine whose output should be fed back to the original application and the affected messages get updated.

I'm really not sure I am picturing how syntax can help in the feedback process. Typically (from my experience), when you expose translated strings to customers, you might have the following ways to gather feedback:

1) Proactive Audits 2) Customer feedback

This feedback typically gets surfaced back to the localization team to fix issues, which can then be tracked part of the internal process.

All these steps are often manual, except if you use Machine Translation. In that case, you can update your models with fixes and expect better output. This would be done by providing feedback to AutoML APIs for example.

dchiba commented 4 years ago

Context is one of the core features of commercial TMSes/CAT tools and I believe there is already good ways to provide this today (outside syntaxes)

External syntax is not portable; when a message is composed in a convention of a specific TMS or CAT tool, the messages are strongly tied to it and they cannot be easily taken to another environment. It is desirable for a message author's intent expressed in the message expressed using a standard syntax to be widely good across different environments.

A message may contain a word or phrase that require some special attention by machines as well as humans. #32, SSML support, would be another good example of it. It would be desirable to enable setting SSML elements directly in a message such as say-as, phoneme, emphasis and break. If the text contained a span in a different language, it may be important to give the language information to the speech synthesizer, or the text renderer.

The main benefit I see from using syntaxes like MessageFormat is to solve linguistic challenges (e.g. plural) - and we still have plenty of big problems (e.g. inflections) in this space which have yet to be solved

I quite agree with this. I am proposing to include a generic extension mechanism which would allow features that require inline markup to be added whenever this group finds it appropriate to work on. I agree they should be worked on later after big and urgent problems have resolved. They should not be totally ignored at this stage because otherwise it would become harder to come up with a new solution and adopt it. It may be common to run the feedback process manually today, but I think that should/will change over time.

A UI text label web component could provide a menu for an end user to suggest fixing a typo and the feedback could be submitted to the translation process. This component may switch to a special mode for editing the text, and create a message that contains the original text and suggested fix for the typo (or possibly an incorrectly pronounced word or phrase) clearly marked using inline markup.

...except if you use Machine Translation. In that case, you can update your models with fixes and expect better output. This would be done by providing feedback to AutoML APIs for example.

I am thinking about running the translation process automatically, as well as using MT technology for translating the content. If there was a standard way to describe what's wrong in the text and how it should be corrected, wouldn't it be easier to apply the suggestion to different MT engines so the model can be updated as intended?

mihnita commented 4 years ago

In my experience any kind of tagging inside the message hinders more than helps. ITS didn't get much traction. And where it did, it was mostly limited to tag what is localizable and what not.

The problems with annotations for terminology:

Slowing down the developers. They often don't even know what English term was decided by in country market research to be translated a certain way

Most CAT tools have integrated glossaries. So part of the project you have a glossary saying "in language X you translate "ABC" as "cde". And the CAT tool automatically detects these terms in the source, tags them, and shows the translated proper translation. And you only need one glossary per product / company. You don't want the developer to have to do this: You can also <term>unsubscribe</term>, <term>cancel</term> or </term>upgrade</term> you <term>plan</term> at any time (with the extra info for each term) And do it every single time and in every single message where these terms show up.

It is often hindering translation. The way languages work can be quite different. It is often the case that if you already uses a certain term you can just refer to it (or leave it implied) in the second part of the message. So forcing translators to keep some kind of tagging (and the word inside it) because "machine learning might use it" results in translations that feel unnatural

In many languages the translation is not 1:1. You have inflections and what not, so it's not that useful for machine translation (in English you can tag "customer" in constructs like "the customer" / "a customer" / "of the customer" / "to the customer" etc). Translations will use different (inflected) forms of the customer. And because the "prefixes" are left out of the tagging in English the machine "learns" that "customer" can be translated in 8 different ways, but no context when. Actually ML works better in longer text, where it can "infer" context.

Messes up the translation memory leveraging (a message tagged will not leverage with one un-tagged, unless there is some special "smartness". And since this is a new "feature" most systems will not have that.

It slows down the translators. And for professional translators time is money.

mihnita commented 4 years ago

On the other hand: totally yes for a way to tag non-localizable sections of the text. And OK if we can come up with a way to tag terminology outside the message itself.

nbouvrette commented 4 years ago

On the other hand: totally yes for a way to tag non-localizable sections of the text. And OK if we can come up with a way to tag terminology outside the message itself.

+1 - I think this ties in nicely to 2 new threads I just created:

File Format

Syntax Simplicity

I think until these points are clarified, it will be hard to imagine where such a feature would fit. But I do agree that to me this seems to be the type of features that would be outside the syntax (just like file formats)

dchiba commented 4 years ago

Slowing down the developers. It slows down the translators.

I am not proposing to have developers/message authors/translators insert or interpret the inline markup, so this does not slow them down unless they really do the mark up manually (they shouldn't).

As we would agree, terminology management is a common requirement. I am simply proposing a standard way to mark up the terms in a message.

A CAT tool could mark a term for them and with the term marked up, various components would be able to process it easily. For example, a web component could use it to highlight the term.

Let's say an original message contained a term: You may cancel at any time.

An inline markup may be added by either a machine or human (The notation is just made up for illustration purpose only): You may {term,term_id_here{cancel}} at any time.

Then a web component may select the term using a class selector (Notation is HTML): You may <div class="term_id_here">cancel</div> at any time.

I think this would help enable a tight integration of a web application and CAT tools.

It is often hindering translation. ... forcing translators to keep some kind of tagging

Are translators expected to work directly on this format? CAT tools provide UI for translating a message. In my understanding they can work without learning the internal message syntax used under the hood. Please correct me if this is a misunderstanding.

In many languages the translation is not 1:1

This is totally true and I think this is an area where it would take a collaboration among t9n/l10n specialists, linguists and ML experts to figure out an optimal solution. What I am proposing is to reserve a markup notation for future extension so the message could optionally indicate the presence of a term and its location.

Messes up the translation memory leveraging

It is in correct to feed content that the consumer doesn't understand. I would envision some filters to convert the markup syntax to a simpler alternative that is more commonly supported. This could be used while consuming systems don't have a native support for the new syntax.

SSML support (#32) is another requirement that I think it is appropriate to use an inline markup for.

33 Support for spoken forms is related to #32, but it may also be desirable to mark up inline, as @grhoten indicated.

What do you think about embedded spans in a different language? I think it should be marked up inline.

mihnita commented 4 years ago

unless they really do the mark up manually (they shouldn't) ... Are translators expected to work directly on this format

There are strong push that this is the case: https://github.com/unicode-org/message-format-wg/issues/48

Any tags slow down translation even with tools protecting said tags.

Just "navigating" around them and making sure that the right text is "inside" the tag will slow you down.

The other part of the "slow down" is mental effort. It is a lot easier to translate a full sentence than a sentence with a placeholder, because now you have to think what can that placeholder be at runtime, if will match grammatically with the stuff around it, etc.

It is not a big deal for a "decent" amount of tags. But translators DO protest if content contains a lot of tags, and they will ask for a higher rate for "technical complexity".

With this (new?) format we don't just ask them to not damage existing placeholders, we ask them to change placeholders, navigate to other strings and add extra info, and so on

See the examples for inflection, where we ask translators to tag the placeholders with info like grammatical cases, and ask them to find the string(s) that can be used in the placeholder and add the grammatical cases if they are missing: https://github.com/projectfluent/fluent/wiki/Fluent-and-ICU-MessageFormat#multi-variant-messages

Mihai

dchiba commented 4 years ago

No matter how the CAT tool supports terminology management, it must locate the term to process it.

Defining an inline markup scheme for terminology management enables components to exchange the information that they can share. For example, supporting search and replace of a term is a common requirement of TMS. An application may allow the user to suggest a change of a term. You double click and select a word on UI. You bring up a menu that might say "Suggest using a better word", which triggers a special mode in which you can submit a suggestion. Then the application may set inline markup according to your suggestion and send it to the backend service where the suggestion is accepted and further processing follows (lookup of other instances, requesting machine translation, starting a human approval flow, etc.). This application may show a preview of suggested change(s) in a wysiwyg manner by applying the special style for highlightiing the terms to the strings returned from the service in the standard inline markup.

This application could have human translators review the suggested changes, in which case the inline markup should not be visible to them. If a raw file needs to be processed by them, then there can be a step to strip the unwanted markups away.

I see translators are good at translation and not as good in dealing with markups. I think we should be able to hide unwanted markups as needed, while we take advantage of the markups where no human translators are involved.

aphillips commented 11 months ago

As mentioned in today's telecon (2023-09-18), closing old requirements issues.

© Githubissues.

Githubissues is a development platform for aggregating issues.