Closed romulocintra closed 1 year ago
Not only this, but we should make it hard or even impossible to write alternates as sentence fragments.
I would like to see a format that is not white-space sensitive
There are pros and cons for this.
In addition to @mihnita's points, there is no well-documented general process for unwrapping (joining) lines (there is on-going discussion between Unicode and W3C I18N and CSS about this topic). This is one of those things that looks simple but has hidden complexity. I have written code that does this, but we need to approach whitespace handling with care.
If there are no language-specific predefined cases, this also means that each variable could use a different keyword to represent the same case?
Yes, this kind of freedom is the part that I don't like too much about the way Fluent handles inflections.
One can use case_gen / gen / geni / genitive
, or m / masc / masculine / male
, and it is all good. No consistency.
And that might also break leveraging...
On the other side the alternative is to capture in the spec all the grammatical complexities in all the human languages. Which I'm also not sure we want to do :-)
We might do what other standards do in some cases (Unicode, BCP 47): say that there are standard "grammatical attributes", subject to "registration". And we collect and register them in time.
Examples: Language Subtag Registry - IANA, IANA Protocol Registries, Unicode Ideographic Variation Database
Another point to the wish-list ("nice to have", not "must have"): "out of the box sentence casing"
What I mean by that: "Your {who} tried to contact you" (who: friend / colleague / sister / brother / etc.) When the target language needs a different order we might end up with something like this: "{who} of yours to contact you tried"
And when we replace the placeholder we end up with a lowercase-starting message. That is not good. And requires the developer to explicitly sentence-case. They should know about that need, or wait for a bug to be filed. And doing it is not as trivial as it looks.
I would argue that this should happen by default (and allow one to disable it if that is a problem). That is a no-op if there was no placeholder in the beginning. And most of the time the messages are full sentences / paragraphs (if they are not, that is often and i18n mistake)
@nbouvrette - is there a place where we could continue this thread without taking over the requirements one?
Do you think this would be easier in a document? (like Google docs?) It would keep the different "threads" together.
On Tue, Jan 21, 2020 at 9:52 AM Mihai Nita wrote:
When the input is 2 messages (singular / plural) and the output is 4 messages (for example because Russian has 4 plural forms), then we run into problems.
There are only 2 grammatical number values (singular and plural) for Russian, but the big issue is that grammatical case is always involved. Russian typically has 3 possible values when you have a numerical value and a noun that you are trying to get into grammatical agreement. I'd be happy to share the complicated table on this topic if someone is curious about it.
George
In our environment we collapse multiple white space characters (newlines and space) into a single space. From experience, translators may not have the discipline nor the desire to care for the number of spaces used in a message. The East Asian languages may want to use spaces too when mixing scripts, but they tend to be more aware of how spacing works. So for us, it's mostly space insensitive.
I can only seeing this insensitivity being an issue if you're working with a monospaced font where column alignment is a part of the message instead of a part of the UI.
George
On Tue, Jan 21, 2020 at 10:25 AM Mihai Nita wrote:
I would like to see a format that is not white-space sensitive
There are pros and cons for this.
There are languages that don't use spaces between words (Chinese, Japanese, Thai, etc.) These are very clunky to use in a not white-space sensitive format. Everybody is "free to wrap things", but these languages are forced to keep everything in one line.
The other disadvantage for a white-space sensitive format is that at times one would really want to preserve spaces / new-lines. Then one would need some mechanism to specify that.
So I would not necessarily "carve in stone" this one (yet)
Probably calling them "plural forms" is not the accurate description.
But the problems is that 2 English forms might have to be translated into 4 Russian "forms". See here an example: http://www.unicode.org/cldr/charts/latest/supplemental/language_plural_rules.html#ru
Each country / language might name this "behavior" slightly different.
For example Romanian has 3 forms. But linguists (and the official grammar books) describe it as "2 plurals: singular and plural", but the plural has "2 different forms"
Often native speakers even deny that the language has 3 forms, they only think "singular and plural". But when asked to say "12 files" vs "24 files" they notice that "some plurals" require different translations.
Anyway... the problems in most (all?) translation tools is this kind of n:m mapping. No matter what we call it.
I would like to see a format that is not white-space sensitive
There are pros and cons for this.
My original reference to this was related to the format string syntax, not the literal message/placeholders.
For example, in message_format
, the spec says that there is a literal newline between female {
and {num_guests, plural, ...
when the intent is very likely for that not to be true.
message = """
{gender_of_host, select,
female {
{num_guests, plural, offset: 1
=0 {{host} does not give a party.}
=1 {{host} invites {guest} to her party.}
=2 {{host} invites {guest} and one other person to her party.}
other {{host} invites {guest} and # other people to her party.}}}
male {
{num_guests, plural, offset: 1
=0 {{host} does not give a party.}
=1 {{host} invites {guest} to his party.}
=2 {{host} invites {guest} and one other person to his party.}
other {{host} invites {guest} and # other people to his party.}}}
other {
{num_guests, plural, offset: 1
=0 {{host} does not give a party.}
=1 {{host} invites {guest} to their party.}
=2 {{host} invites {guest} and one other person to their party.}
other {{host} invites {guest} and # other people to their party.}}}}
"""
Since any string-based format is likely to be quite long and formatted over multiple lines, the syntax itself should desirably not be white space sensitive.
My original reference to this was related to the format string syntax, not the literal message/placeholders.
Ah, I see what you mean. The current MessageFormat syntax is a bit of a weird one. It ignores whitespaces in the syntax outside the messages, but it preserves the whitespaces in the messages.
But "message" I mean the stuff that ends up on scree, the string inside the {...}
, for example
{{host} invites {guest} and # other people to her party.}
Comparing it to programming languages (except for Python :-) I think of the "decisions" part as code,
and the messages inside the {...}
as strings.
switch(gender_of_host) {
case female: return "{host} invites {guest} to her party.";
case male: return "{host} invites {guest} to his party.";
default: return "{host} invites {guest} to their party.";
}
One more though about white-spaces:
I would like to nominate the following.
The way a timezone adjustment is made in date/time formatting should be clearly specified. The default timezone conversion behavior should be reasonable and unambiguous. The message author should be able to optionally specify a desired timezone conversion. This is meant to make it easier for applications to support timezones correctly.
I think this is extremely obvious but wanted to emphasize that it is is highly desirable for a message formatter that can handle date/time/number/etc. per CLDR to be readily available for applications, to meet the common needs to localize the locale sensitive data values.
For example, the need for support of common predefined date/time/number formats, as well as skeleton patterns, cannot be overemphasized. In addition, a message author should be able to express the intent to use relative time formatting or compact number formatting.
Advanced requirements should be covered by "pluggable formatters", while this one covers the basic ones.
This is for the ability to put a comment over a specific span of the message. This is useful to clarify what a translation note to translator is referring to and clearly communicate it to the translators. For example (the markup with [] is for illustration of the concept only):.
Let's [[environmentally conscious]go green].
@nbouvrette mentioned "extendable inline" and inline comment could be thought of an instance of it. There can be many other ways to use it, like designating an untranslatable span, for another example:
Type [[translate='no']history] on the command line.
Internationalization Tag Set (ITS) from W3C defines various "data categories" of the information that can be set on a span for automated processing of human language. As a matter of fact, comments are in the "Localization Note" category and whether a span is to be translated or not is in the "Translate" category.
https://www.w3.org/TR/its20/#basic-concepts-datacategories
In some cases, the metadata may apply to the whole string. I assume that is covered by the bullet that says 'Messages should have more context “description” or ”metadata”'.
I would like to nominate the following.
- Well-defined timezone handling
The way a timezone adjustment is made in date/time formatting should be clearly specified. The default timezone conversion behavior should be reasonable and unambiguous. The message author should be able to optionally specify a desired timezone conversion. This is meant to make it easier for applications to support timezones correctly.
It's my preference that the time zone handling should be a part of the calendar object being formatted and not in the message format.
- Default formatters
I think this is extremely obvious but wanted to emphasize that it is is highly desirable for a message formatter that can handle date/time/number/etc. per CLDR to be readily available for applications, to meet the common needs to localize the locale sensitive data values.
For example, the need for support of common predefined date/time/number formats, as well as skeleton patterns, cannot be overemphasized. In addition, a message author should be able to express the intent to use relative time formatting or compact number formatting.
Advanced requirements should be covered by "pluggable formatters", while this one covers the basic ones.
From my experience with numbers, there's actually 3 CLDR number formats that need to be used and sometimes customized during formatting time. They come from the ICU classes DecimalFormat, RuleBasedNumberFormat and CompactDecimalFormat. For example, I may want to print out with DecimalFormat and speak an ordinal with RuleBasedNumberFormat.
I agree with you in principle. There is some overlap within CLDR and ICU that might need to be addressed too. Avoiding conflict and confusion between the pluggable formatters should be addressed by this functionality.
Though from experience, pluggable formatters for more complex concepts frequently need context. For example, I may need the correct definite article, preposition or pronoun coming out of a custom formatter. So any advanced pluggable formatters should be able to convey language specific context of where it's being used in a sentence.
- Inline comments / extendable inline
I have had mixed results with this. Some translators translate the comments too, especially for first timers, and they don't realize that the final message recipient won't see them, which wastes translation time. There are other times when there is information that is best conveyed inline. Sometimes the comments get in the way of readability. I can see the pros and cons of such functionality.
@nbouvrette - is there a place where we could continue this thread without taking over the requirements one?
Please open new issue. This new thread can work as knowledge share about MF and related topics.
I opened up issue #15 accordingly to discuss HTML.
(In response to the question about whether Google Docs would be easier, I think the answer is the same as to the question of whether we should have a chat group -- we discussed a few months ago to keep discussions in Github so that they're public and searchable and no extra logins, based on people's past experiences.)
@mihnita
if the XLIFF 1.2 (final spec in Feb 2008) is not properly supported, how long will it take for CAT tools to support it?
It's been 12 years already... I think it's safe to say it will never be fully supported? :) And it's not just CAT tools, it's also TMSes. There are dozens of both these products on the market and some of the top players are not known to move very quickly.
And this is also an argument for developing the format while considering at all times how that will interact with existing CAT tools. That includes not only how things are presented to the translators, but how leveraging works (or not).
Developing a new format can be quite challenging to have broad support (XLIFF is a good example). I still believe it would be a lot simple if we can find a way to remain format agnostic.
Another advantage if we can stay format agonistic is that most TMSes can support multi-level filter when parsing strings which means, you could have an HTML document with MessageFormat strings inside and they could be both parsed and presented correctly to linguists. This could also work the other way around.
Most CAT / TMs assume a 1:1 model "you give me a source message, I give you back a translated message". When the input is 2 messages (singular / plural) and the output is 4 messages (for example because Russian has 4 plural forms), then we run into problems.
Exactly, this is the biggest challenge - most linguistic tools expect symmetric keys in both the input and output and one input can have multiple outputs in multiple languages that have different rules. This is also why MessageFormat works well, regardless of the file format.
Do you think this would be easier in a document? (like Google docs?) It would keep the different "threads" together.
I tried Google docs to have conversations in the past and so far Git seems better - I would still love to propose having our own Slack at some point if we start having more active conversations but Git is also good at keeping everything documented. I just tagged you in this new thread when you have a chance!
The current MessageFormat syntax is a bit of a weird one. It ignores whitespaces in the syntax outside the messages, but it preserves the whitespaces in the messages.
Is there a reason for this? I wrote a parser that preserves both whitespaces. I used this both for syntax highlighting and also auto-completion/validation & error detection. It's a lot easier to be able to refer to a character position without changing the input for example.
Inline comments / extendable inline
I have had mixed results with this. Some translators translate the comments too, especially for first-timers, and they don't realize that the final message recipient won't see them, which wastes translation time. There are other times when there is information that is best conveyed inline. Sometimes the comments get in the way of readability. I can see the pros and cons of such functionality.
@grhoten
+1 on your comment - there are other ways to provide comments (typically called context) to linguists which handled correctly today by most TMSes. If we need inline context, there might be something too complex with the syntax.
Closing resolve-candidates per discussion in 2023-07-24 call
List of requirements to consider for MF