projectfluent / fluent

Fluent — planning, spec and documentation
https://projectfluent.org
Apache License 2.0
1.41k stars 45 forks source link

New syntax for meta-data #7

Closed stasm closed 7 years ago

stasm commented 7 years ago

Goal

Provide a simple means for defining private meta-data for messages.

Description

Currently, meta-data can be added to messages by using traits. Traits without namespaces are considered private.

brand-name =Firefox
  [gender] masculine

5 and #6 will simplify traits and we'll need a new way to encode meta-data.

The proposal is to use binary tags attached to the value:

#masculine
brand-name = Firefox

The benefit of the binary approach is that there's usually no need to name the property in question (gender).

Discussion

https://groups.google.com/forum/#!topic/mozilla.tools.l10n/dhWfBXHzuZI

Pike commented 7 years ago

One of the big wins of FTL was that everything about the message was in the value part of the syntax.

There is some level of beauty that you start the message with the ID. Right now, only comments break that. For tooling that is a boost.

Also, does the

#masculine vs
# masculine

create challenges in error recovery? The ' ' would also be a typo that would be really hard to debug, as

# masculine
#masculine
bob = Bob

would be totally legal, right?

zbraniecki commented 7 years ago

I think my vote would go for semantic comments for meta-data.

stasm commented 7 years ago

I think we used the # sigil as an example and then I copied it here without realizing we already use # for comments!

I'd love to discuss about semantic comments more. JSDoc-style @param clauses would certainly help tooling. And it might be possible to encode language-specific meta-information in the comments as well (@meta foo?).

Pike commented 7 years ago

@phlax, @ta2-1, does prefixing messages with metadata impact how pootle implements l20n support?

ta2-1 commented 7 years ago

Hi @Pike, thanks for heads up. I'll check.

ta2-1 commented 7 years ago

@Pike, I’d say that we use l20n libraries so as long as that is consistent we’re pretty much unimpacted by syntax changes. I believe that @mathjazz is in the same situation.

stasm commented 7 years ago

@Pike suggested that we separate the meta-information from semantic comments. The reason for this is that he sees semantic comments as relating to the toolchain and the process (@param, @rev), while the meta-information is strictly language-specific and private.

He suggested the following syntax:

# The short name of the app. 
brand-name = {
       *[nominative] Firefox
        [genitive] Firefox's
    }

    [masculine, inanimate]

The reason to use the brackets is that it closely resembles the way this information will be used. This in turn improves the copy&paste-ability of the syntax:

has-crashed = { META(brand-name) ->
       *[masculine] { brand-name } has crashed.
    }

Pike also likes the idea that everything defined below the identifier belongs to the message and is editable by the localizer.

stasm commented 7 years ago

I like @Pike's proposal and I thinks it's sound. I'd like to hear @zbraniecki's thoughts, of course. I have some reservations, too: I was hoping we could piggy-back on #16 to implement this. Meta-information should be rare enough that maybe it shouldn't get its own syntax. OTOH, it also is what make Fluent and FTL very powerful.

On the note of being rare enough, @Pike and I also discussed about not allowing meta-information on messages which have attributes. Such messages are meant to localize UI widgets and should not carry grammatical information. In fact, maybe we should rename meta-information to grammatical data or something similar.

zbraniecki commented 7 years ago

He suggested the following syntax:

My initial reaction is that this syntax seems confusing.

Meta-information should be rare enough that maybe it shouldn't get its own syntax.

This is my thinking too. As we said in the beginning - in all our work with L20n/FTL so far, we failed to find another example of the use case beyond gender. Since we raised this a month ago we still didn't find a single other use case.

For that reason I find the idea of adding a specific syntax excessive. It adds a new source of potential bugs and errors in malformed content, in order to serve a single use case.

It will work, and as we said it's more important how will users retrieve that bit because it'll be way more common, but I'm not sure if we should be adding a whole new data type on Message to serve this individual goal.

On the other hand, functionally, I agree that semantic comments as we're thinking about them are functionally different from meta information like gender. My brain experiment is that I can't see a reason for a localizer to call for META(rev).

So, I'll probably be reluctantly ok with this proposal, but I have another:

# @rev: 2
# @meta: masculine
brandName = {
 *[nominative] Firefox
  [posessive] Firefoksa
}

caller = { META(brandName) ->
 *[mascline] Foo
  [feminine] Faa
}

I recognize that it doesn't play with @Pike 's "all localizable info below identifier", but I guess I just don't share this concern.

Pike commented 7 years ago

The idea behind keeping the localizer data beneath the ID is one of incremental tool support:

It allows l10n tools to have the most rudimentary support, like we currently do for pontoon. You get a text area, and anything in that area is to be edited by localizers. With localizer-editable semantic comments, that's a lot more complex. You can also more easily allow to switch to a text editor if a localizer needs a feature which your tool doesn't support yet.

The other part about using the [] mark-up (to avoid the word syntax) is that [] denotes the option definition and reference for variants. Meta is the same thing in the reverse direction, and there's beauty in keeping [] as an easy to copy-n-paste markup on both source and target of the reference.

I can see us explaining [] as references between messages, and you never need to translate one markup into another.

stasm commented 7 years ago

You can also more easily allow to switch to a text editor if a localizer needs a feature which your tool doesn't support yet.

Wouldn't it be easier for a tool to gracefully downgrade to a text editor if the whole message, including the comments can be parsed and serialized?

The other part about using the [] mark-up (to avoid the word syntax) is that [] denotes the option definition and reference for variants. Meta is the same thing in the reverse direction,

I'm still not completely sold on this reverse direction thing. Grammatical information defined as meta-data has little to do with variants, doesn't it?

I can see us explaining [] as references between messages, and you never need to translate one markup into another.

There's some beauty in using [ ] as well as some confusion. When you define a variant of a select-expression with brackets you're saying: match this thing inside. [other] Other means match 'other' and return 'Other'. So, at least for me, the brackets mean match. Here OTOH, the brackets define a piece of grammatical informations and I'm still struggling with this inconsistency.

I don't have any better ideas right now and I see the values of everything previously suggested here. I'm tempted to postpone this issue until a later milestone.

zbraniecki commented 7 years ago

When you define a variant of a select-expression with brackets you're saying: match this thing inside. [other] Other means match 'other' and return 'Other'. So, at least for me, the brackets mean match. Here OTOH, the brackets define a piece of grammatical informations and I'm still struggling with this inconsistency.

This sentence describes my sentiment very well.

stasm commented 7 years ago

I don't want to rush a design decision here. Let's move this out of the scope of 0.2. This means that temporarily the syntax will not give any dedicated way of defining language-specific grammatical data.

(As a workaround, it's still possible to create entirely new local messages containing that data and refer to them, e.g. gender-of-brand-name = masculine. This is not recommended though.)

Pike commented 7 years ago

Do we get a good baseline to, say, ship L20n on Android without coming to a conclusion here?

To the actual conversation, let me try to depict my thinking:

brand.ftl:

brandName = {
    *[nominative] Firefox
     [posessive] Firefoksa
}
[gender] masculine

updates.ftl:

should_restart = { META(brandName) ->
    *[feminine] I would like her { brandName[posessive] } to be restarted
   [masculine] I would like his { brandName[posessive] } to be restarted
}

(omg, butchering some other language's grammar here)

My point is that when I resolve the variants of brandName, I use [] on both sides.

I think it's a good idea for the reverse direction to also use [] on both sides. If not that, but then to use the same mark-up on both sides. The pre-ID comment proposals use different markup on one side compared to the other, and that makes life hard.

zbraniecki commented 7 years ago

Do we get a good baseline to, say, ship L20n on Android without coming to a conclusion here?

I believe we should reach a solution here before we release L20n on Android.

stasm commented 7 years ago

I believe we should reach a solution here before we release L20n on Android.

+1 to that. I just don't want to lower to quality of 0.2 by rushing this decision right now.

stasm commented 7 years ago

[gender] masculine

@Pike, did you mean [masculine]?

The pre-ID comment proposals use different markup on one side compared to the other, and that makes life hard.

I see what you mean: in a selector-less list of variants, we also use brackets to define variants and we match them from the outside. In case of variants, however, the symmetry is between the definition and the reference. Both use [key]:

brand-name = {
       *[nominative] Firefox
        [locative] Firefoksie
    }
about = O { brand-name[locative] }

You'll never find yourself trying to match locative in another select-expression.

This is not true for grammatical information. Once it's defined, it's meant to be matched in other select-expressions. If you somehow define brand-name to be feminine, you can then match the gender elsewhere:

has-been-updated = { brand-name } { META(brand-name) ->
       *[masculine] został zaktualizowany.
        [feminine] została zaktualizowany.
    }

Furthermore, you must not assume that you can reference feminine in any other way than by using META. In particular, brand-name[feminine] will break.


...unless it doesn't. What if we used variants for all grammatical information? Variants are private and can be accessed from other messages. Grammatical information will be only added to messages which already may have other grammatical variants. We wouldn't be adding any new syntax. The word "variant" may not be the best one here, but in general, the construct seems to lend itself well to the use-case.

In English:

brand-name = Firefox
about = About { brand-name }
updated = { brand-name } has been updated.

In French:

brand-name = {
       *[nom] Firefox
        [genre] masculin
    }
about = A propos de { brand-name }
updated = { brand-name } { brand-name[genre] ->
       *[masculin] a été mis à jour.
        [féminin] a été mise à jour.
    }

In Polish:

brand-name = {
       *[mianownik] Firefox
        [miejscownik] Firefoksie
        [rodzaj] męski
    }
about = O { brand-name[miescownik] }
updated = { brand-name } { brand-name[rodzaj] ->
       *[męski] został zaktualizowany.
        [żeński] została zaktualizowany.
    }

Semantically, gender isn't a facet of the string value of brand-name but maybe that's okay for now. We can still choose to add an explicit syntax for this later.

Pike commented 7 years ago

Groundhog Day. That's the train of thought that lead us to traits.

Pike commented 7 years ago

One a less snarky note, putting meta data into the variants would

stasm commented 7 years ago

Groundhog Day. That's the train of thought that lead us to traits.

Yes, I know. I'm looking for solutions everywhere I can find them :)

stasm commented 7 years ago

That's the train of thought that lead us to traits.

Also, I feel like this is related but not accurate. We've always had three types of data: variants, grammatical descriptors and attributes. With traits, we lumped all of them together. Previously (L20n 1.0) descriptors and attributes were expressed with the same syntax. Even earlier (your designs from a long time ago) attributes and variants were together, while descriptors were separate.

I feel like we're going in circles.

stasm commented 7 years ago

allow to return the value to the program (good? bad?)

Probably bad, or at least unintended. That would only happen if the meta data variant has the * prefix, right?

not allow partial matching for something like [masculine, inanimate]

That would be possible with nested select-expressions or with list-selectors (#4).

What I really dislike about my proposal is that it forces localizers to find names for the meta-data: gender, animacy, etc. I'd much prefer a solution with binary descriptors, like "masculine". I'll come back to this issue next week and try to get some perspective this week.

stasm commented 7 years ago

After a short break the idea of putting the grammatical information into variants seems bad, I admit. Perhaps it was a necessary step back for me to consider other options :)

Over the weekend I did some small-scale user-testing. I presented two FTL files, one in English and another one in Polish to a few friends and asked them to complete the Polish translation. The only thing they knew about FTL beforehand was that translations had unique identifiers. The Polish file also already featured some grammar-sensitive syntax.

After completing the task (which went very well) I asked a few follow-up questions. Below is a bullet-point summary of the conclusions:

Based on that, here is my newest proposal:

[[ English ]]

# A short name of the app.
brand-name = Firefox
about-app = About { brand-name }
has-updated = { brand-name } has been updated.

[[ French ]]

# A short name of the app.
brand-name = Firefox
    +masculin

about-app = À propos de { brand-name }
has-updated = { brand-name ->
       *[+masculin] { brand-name } a été mis à jour.
        [+feminin] { brand-name } a été mise à jour.
    }

[[ Polish ]]

# A short name of the app.
brand-name = {
       *[mianownik] Aurora
        [miejscownik] Aurorze
    } 
    +żeński

about-app = O { brand-name[miejscownik] }
has-updated = { brand-name ->
       *[+męski] { brand-name } został zaktualizowany.
        [+żeński] { brand-name } została zaktualizowana.
    }
flodolo commented 7 years ago

I wonder if "classes" would be confusing as name for these definitions (e.g. gender).

Based on that, here is my newest proposal:

How do you associate two or more classes to a string?

+masculin
+something_else

vs

+masculin,something_else

There is one thing that I find confusing though:

stasm commented 7 years ago

(I'm going to use the # sigil in the snippets below, since #28 is close to landing.)

How do you associate two or more classes to a string?

foo = The Foo
    #feminine
    #someting_else

I'd like to think of them as tags, and actually just call them that: tags. I'm sure traits, classes or properties would make sense here too. Given the syntax, I'd like to piggy-back on the fact that people know what hashtags are.

I would expect to be able to use [masculin], since I'm defining the masculine version of this string.

I understand the rationale. I think there are two ways to go forward and they're not mutually exclusive:

has-updated = { TAG(brand-name) ->
       *[masculin] { brand-name } a été mis à jour.
        [feminin] { brand-name } a été mise à jour.
    }
has-updated = { brand-name ->
       *[#masculin] { brand-name } a été mis à jour.
        [#feminin] { brand-name } a été mise à jour.
    }

We could start with the first one and add the second one as syntax sugar later on.

Pike commented 7 years ago

What would

has-updated = { brand-name ->
 *[masculin] { brand-name } a été mis à jour.
   [feminin] { brand-name } a été mise à jour. }

do? I'm concerned that adding two variants with subtle difference would add more confusion than help?

stasm commented 7 years ago

It would try to match masculin then feminin against the value of brand-name, fail, and fall back to the variant marked with *.

stasm commented 7 years ago

After a lot more further thinking: I like @Pike's proposal in https://github.com/projectfluent/fluent/issues/7#issuecomment-279444551 the most. I realized that I don't see a use-case for matching against the values of messages. Doing so would make the translation not portable. If a language has special rules for nouns starting with a vowel, it's much better to match a hashtag vowel than the literal value Aurora. The latter breaks for any other brand name.

zbraniecki commented 7 years ago

Just as a mental check, does it mean that we're 100% sure that we will never want to match against the value?

It seems to me like we won't, but I want us all to think it through explicitly because if implement what :stas is proposing we will never have an intuitive way to do that :)

stasm commented 7 years ago

Thanks, @zbraniecki, for asking. If we ever want to change our mind, we can implement a new approach inside of how variant keys match against the selector. If it's a Message, we can first look into its tags and then fall back onto its value. Or we can provide functions that allow the user to be more specific: VALUE(brand-name) or similar.

That said, I doubt that we'll want or need to do that. Famous last words?

zbraniecki commented 7 years ago

The last proposal is my concern. If we'll end up having a use case, and if that use case will end up being more common than this one, we'll end up having the API that makes the wrong thing easy.

If we'll try to have a smart API (check for tags, check for values), then it sounds like it'll work well.

I assume we won't allow attributes and tags on the same Message, right?

stasm commented 7 years ago

The last proposal is my concern. If we'll end up having a use case, and if that use case will end up being more common than this one, we'll end up having the API that makes the wrong thing easy.

I see what you mean. I think we could make the no-syntax variant be a smart one, and then expose VALUE and TAGS helpers. But only if we see a need for that.

If we'll try to have a smart API (check for tags, check for values), then it sounds like it'll work well.

+1

I assume we won't allow attributes and tags on the same Message, right?

Yes, correct. The rationale is that messages with tags are supposed to be interpolated into other messages. If they need to be displayed in the UI which requires an attribute, a new message can be created for that purpose and it can reference the message with tags.