unicode-org / inflection

code, data and documentation related to handling inflection problems
Other
0 stars 1 forks source link

Support making words definite, indefinite or construct #6

Open grhoten opened 4 months ago

grhoten commented 4 months ago

Support should be added to make a word definite, indefinite or construct. The construct form is a discussion point for Semitic languages like Hebrew or Arabic.

Here are some examples:

English

Spanish

French

Swedish case singular & indefinite singular & definite plural & indefinite plural & definite
nominative katt katten katter katterna
genitive katts kattens katters katternas
BrunoCartoni commented 4 months ago

Are there any specific messages that could benefit from such mechanism?

grhoten commented 4 months ago

I recommend reviewing this UTW video called Automatic Grammar Agreement in Message Formatting. In languages that frequently gender their nouns, the definite and indefinite article varies a lot, and it depends on the grammatical properties of the noun or adjective adjacent to the article. So if I ever want to say "The ${device} is on", knowing how to put the definite article in front of the device is very important, especially when the vocabulary of "device" is significantly large or provided by the user. For a language like Swedish, you don't add an article, you inflect the word.

BrunoCartoni commented 4 months ago

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

macchiati commented 4 months ago

I suspect that in English it's an elision for "turned on", a preposition in a phrasal verb.

On Thu, Mar 14, 2024, 01:33 BrunoCartoni @.***> wrote:

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-1996855919, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMEXY35Z4IXSPSWOXSDYYFOHBAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWHA2TKOJRHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

grhoten commented 4 months ago

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

Sorry for the confusion. Here's 2 ways that this can be supported that I'm currently aware of for addressing this specific topic. I'm using Spanish in my example.

  1. ^[El %@](inflect: true)
  2. ${device.definite}

The "%@" in the first example is the variable name with Markdown and JSON syntax. The "device" in the second example is the variable name with UEL syntax. I don't have a proposed solution for Unicode's MFWG syntax, and I think that should be a separate topic of mapping the concept into syntax.

macchiati commented 4 months ago

In MF2 #2 would be something like:

{$device definiteness=definite}

On Thu, Mar 14, 2024, 10:55 George Rhoten @.***> wrote:

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

Sorry for the confusion. Here's 2 ways that this can be supported that I'm currently aware of for addressing this specific topic. I'm using Spanish in my example.

  1. ^[El %@](inflect: true)
  2. ${device.definite}

The "%@" in the first example is the variable name with Markdown and JSON syntax. The "device" in the second example is the variable name with UEL https://en.wikipedia.org/wiki/Unified_Expression_Language syntax. I don't have a proposed solution for Unicode's MFWG syntax, and I think that should be a separate topic of mapping the concept into syntax.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-1998020945, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCM75WP6GDZRJACVCLYYHQAPAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGAZDAOJUGU . You are receiving this because you commented.Message ID: @.***>

BrunoCartoni commented 4 months ago

Thanks for the clarification!

Probably a bit off-topics, but how can we ensure that message authors (i.e. probably developers) use the correct syntax?

On Thu, Mar 14, 2024 at 8:39 PM Mark Davis @.***> wrote:

In MF2 #2 would be something like:

{$device definiteness=definite}

On Thu, Mar 14, 2024, 10:55 George Rhoten @.***> wrote:

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

Sorry for the confusion. Here's 2 ways that this can be supported that I'm currently aware of for addressing this specific topic. I'm using Spanish in my example.

  1. ^[El %@](inflect: true)
  2. ${device.definite}

The "%@" in the first example is the variable name with Markdown and JSON syntax. The "device" in the second example is the variable name with UEL https://en.wikipedia.org/wiki/Unified_Expression_Language syntax. I don't have a proposed solution for Unicode's MFWG syntax, and I think that should be a separate topic of mapping the concept into syntax.

— Reply to this email directly, view it on GitHub < https://github.com/unicode-org/inflection/issues/6#issuecomment-1998020945>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACJLEMCM75WP6GDZRJACVCLYYHQAPAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGAZDAOJUGU>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-1998293720, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGM2AFBKTUEUVLWN4MU67IDYYH4HHAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGI4TGNZSGA . You are receiving this because you commented.Message ID: @.***>

--

Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead | @.*** | +41.79.246.80.46

macchiati commented 4 months ago

MF 2 and other systems will detect that the syntax is incorrect, ie, #1 and

2 are disallowed in MF 2

On Fri, Mar 15, 2024, 07:26 BrunoCartoni @.***> wrote:

Thanks for the clarification!

Probably a bit off-topics, but how can we ensure that message authors (i.e. probably developers) use the correct syntax?

On Thu, Mar 14, 2024 at 8:39 PM Mark Davis @.***> wrote:

In MF2 #2 would be something like:

{$device definiteness=definite}

On Thu, Mar 14, 2024, 10:55 George Rhoten @.***> wrote:

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

Sorry for the confusion. Here's 2 ways that this can be supported that I'm currently aware of for addressing this specific topic. I'm using Spanish in my example.

  1. ^[El %@](inflect: true)
  2. ${device.definite}

The "%@" in the first example is the variable name with Markdown and JSON syntax. The "device" in the second example is the variable name with UEL https://en.wikipedia.org/wiki/Unified_Expression_Language syntax. I don't have a proposed solution for Unicode's MFWG syntax, and I think that should be a separate topic of mapping the concept into syntax.

— Reply to this email directly, view it on GitHub <

https://github.com/unicode-org/inflection/issues/6#issuecomment-1998020945>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACJLEMCM75WP6GDZRJACVCLYYHQAPAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGAZDAOJUGU>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/unicode-org/inflection/issues/6#issuecomment-1998293720>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/BGM2AFBKTUEUVLWN4MU67IDYYH4HHAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGI4TGNZSGA>

. You are receiving this because you commented.Message ID: @.***>

--

Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead | @.*** | +41.79.246.80.46

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-1999783617, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEME3VBXAYTTUYWFVTFDYYMALBAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJZG44DGNRRG4 . You are receiving this because you commented.Message ID: @.***>

BrunoCartoni commented 4 months ago

Just to be sure: if the message author writes " the ${device} is on" (instead of " ${device, state=definiteness} is on"), is there a way to detect that the message is ill-formed?

On Fri, Mar 15, 2024 at 3:46 PM Mark Davis @.***> wrote:

MF 2 and other systems will detect that the syntax is incorrect, ie, #1 and

2 are disallowed in MF 2

On Fri, Mar 15, 2024, 07:26 BrunoCartoni @.***> wrote:

Thanks for the clarification!

Probably a bit off-topics, but how can we ensure that message authors (i.e. probably developers) use the correct syntax?

On Thu, Mar 14, 2024 at 8:39 PM Mark Davis @.***> wrote:

In MF2 #2 would be something like:

{$device definiteness=definite}

On Thu, Mar 14, 2024, 10:55 George Rhoten @.***> wrote:

does it mean that the person who authors the message in the first place would need to write: " ${the device} is on" ?

Sorry for the confusion. Here's 2 ways that this can be supported that I'm currently aware of for addressing this specific topic. I'm using Spanish in my example.

  1. ^[El %@](inflect: true)
  2. ${device.definite}

The "%@" in the first example is the variable name with Markdown and JSON syntax. The "device" in the second example is the variable name with UEL https://en.wikipedia.org/wiki/Unified_Expression_Language syntax. I don't have a proposed solution for Unicode's MFWG syntax, and I think that should be a separate topic of mapping the concept into syntax.

— Reply to this email directly, view it on GitHub <

https://github.com/unicode-org/inflection/issues/6#issuecomment-1998020945>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACJLEMCM75WP6GDZRJACVCLYYHQAPAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGAZDAOJUGU>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub <

https://github.com/unicode-org/inflection/issues/6#issuecomment-1998293720>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/BGM2AFBKTUEUVLWN4MU67IDYYH4HHAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGI4TGNZSGA>

. You are receiving this because you commented.Message ID: @.***>

--

Bruno Cartoni (he/him) Staff Linguist Pride at Google Zürich Lead
@.*** +41.79.246.80.46 <+41%2079%20246%2080%2046>

— Reply to this email directly, view it on GitHub < https://github.com/unicode-org/inflection/issues/6#issuecomment-1999783617>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACJLEME3VBXAYTTUYWFVTFDYYMALBAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJZG44DGNRRG4>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-2000348926, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGM2AFB33ECM6RGNSAO4Q63YYNFZPAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBQGM2DQOJSGY . You are receiving this because you commented.Message ID: @.***>

--

Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead | @.*** | +41.79.246.80.46

grhoten commented 4 months ago

is there a way to detect that the message is ill-formed?

Can you clarify what you mean?

  1. If you mean if the state=definiteness was not used for a language that would benefit from using such syntax, I don't think that's within the scope here. That seems like a lint/static analyzer topic. I would prefer to talk about how to make it possible instead of worrying about how authors are not benefiting from such functionality. I don't consider it ill-formed in such a situation. It's maybe worth of a warning in the message formatting framework.
  2. If you mean that the device variable is already definite, say it was named "The light". That's easy to detect and leave it as is instead of turning it into "The The light".
  3. If you mean that the device variable is already definite through other styles, say it was named "My light", and you wanted to change it to "Your light", or you didn't want to turn it into "The My light", that's a harder topic. In that case, it not about it being ill-formed. It's about grammatical correctness. I'm fine with being aware of such situations. At a certain point, I'd rather defer handling more complex messages to a future date. I just want to handle the simple example in this issue.
BrunoCartoni commented 4 months ago

Sorry for not being clear!

My question comes from some discussion with translators at Google. They often complain that if the original message is not formatted the right way, and they cannot change it (maybe this is specific to Google, not sure).

So if they receive a message like: English: "Welcome" they cannot produce something like {female {Benvenida}, male {Benvenido}, etc...).

But maybe this is just a limitation on their side, and we should assume that translators can always modify the syntax?

On Mon, Mar 18, 2024 at 9:55 AM George Rhoten @.***> wrote:

is there a way to detect that the message is ill-formed?

Can you clarify what you mean?

  1. If you mean if the state=definiteness was not used for a language that would benefit from using such syntax, I don't think that's within the scope here. That seems like a lint/static analyzer topic. I would prefer to talk about how to make it possible instead of worrying about how authors are not benefiting from such functionality. I don't consider it ill-formed in such a situation. It's maybe worth of a warning in the message formatting framework.
  2. If you mean that the device variable is already definite, say it was named "The light". That's easy to detect and leave it as is instead of turning it into "The The light".
  3. If you mean that the device variable is already definite through other styles, say it was named "My light", and you wanted to change it to "Your light", or you didn't want to turn it into "The My light", that's a harder topic. In that case, it not about it being ill-formed. It's about grammatical correctness. I'm fine with being aware of such situations. At a certain point, I'd rather defer handling more complex messages to a future date. I just want to handle the simple example in this issue.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/6#issuecomment-2003983125, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGM2AFAVRBLZ7CRMZC5NS4DYY3W4JAVCNFSM6AAAAABELUOSC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTHE4DGMJSGU . You are receiving this because you commented.Message ID: @.***>

--

Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead | @.*** | +41.79.246.80.46 <+41%2079%20246%2080%2046>

grhoten commented 4 months ago

That’s a valid issue, but I consider that to be a message format framework specific issue outside the concept of inflection. I don’t believe that the MF2 framework has a connection to the pronoun information nor the concept for it.

The 3 other frameworks that I’ve been involved with have varying degrees of automatic access to the pronoun information. 1 requires the message author to adopt the framework extension, which can cause a translator communication issue. 2 can usually inflect anything without developer intervention, but various levels of mistakes by developers can still happen.

I’d prefer the inflection engine to be separate from the message format syntax, and I’d prefer to separate out message format adoption issues separate from this topic of just adding the ability to add a specific type of definiteness to a word or concept.

macchiati commented 3 months ago

We want the inflection information to work for multiple clients, including but of course not limited to MF2.0

Going back to Bruno's question about:

Just to be sure: if the message author writes " the ${device} is on" (instead of " ${device, state=definiteness} is on"), is there a way to detect that the message is ill-formed?

MF2.0 is still in development, especially the inflection bits, so caveat lector.

Say the English is:

The {$device} is on.

In general, it is the localization software that allows translators access to the message. I think the thinking is that certain option values will be translatable (like definiteness and case), so that for translating into German, the translator could replace that message pattern by something like the following. Delete the redundant text, and add the state option.

{$device state=definite} ist eingeschaltet.

That would thus handle:

Das Gerät ist eingeschaltet.
Die Maschine ist eingeschaltet.
Der Trockner ist eingeschaltet.

Now, if $device could be plural, the normal mechanism would be the following. Remember, the translator will not see the syntax; it should be presented in a much friendlier way.

English

.match {$deviceCount :number}
one {{The {$device} is on.}}
* {{The {$device} are on.}}

German

.match {$deviceCount :number}
one {{{$device state=definite} ist eingeschaltet.}}
* {{{$device state=definite} sind eingeschaltet}}

Now, if the gender of the device matters (which it does in many languages), then the localization software would expand as follows. So there would be 4 variant sub-message that would need to be translated. In order for this expansion to occur, we'd have to supply the information that arbitrary objects can be masculine or feminine in French.

French

.match {$deviceCount :number}{$device :gender}
one feminine {{{$device state=definite} est allumée.}}
one * {{{$device state=definite} est allumé.}}
* feminine {{{$device state=definite} sont allumées.}}
* * {{{$device state=definite} sont allumés.}}

(Forgive my French.)

Now, that is if MF2 mostly follows MF1 selection. If it allows for inflection engines that can recast literal data, then this could be simplified down something like:

{$device state=definite} {|est allumée.| :agree gender=$device plural=$device}

{|est allumée.| :reset gender=$device} just means

  1. take the literal text inside of |…| — in this case "est allumee."
  2. change that literal text so that it agrees in gender and plural category with $device.

Note that this is all quite speculative. The syntax and basic functions of MF are in place, but not the extensions for grammar. So the :agree function is not at all defined; that is just an illustration of how it might work, as is :gender above.

An interesting point is that the simpler one-line message also requires more knowledge of the translator and/or translation software.


Now, the interesting (read "hard") bit is where categories in English (assuming that is the source) don't match the categories in French. For example, suppose that you have to say

Les appareils sont allumés.
Les machines sont en marche.

That gets tricky