unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
228 stars 33 forks source link

Syntax Simplicity #48

Closed nbouvrette closed 1 year ago

nbouvrette commented 4 years ago

Is your feature request related to a problem? Please describe. Linguistic challenges are complex and having a simple way to solve them can also be complex. Message Format seems to have kept a certain level of simplicity which makes adoption easier, especially by non-technical users like linguists.

Describe the solution you'd like I would like for the new syntax to remain simple (at least as simple as Message Format today, or even more simple if this is even possible).

Describe why your solution should shape the standard By having a simple syntax, it will help both authors and linguists to manipulate raw syntax without having to spend too much time learning it.

There is also a limit in terms of complexity that linguists will be willing to learn, especially if we are aiming for global adoption. Linguists are language experts, not engineers.

If we presume that raw syntax cannot be translated directly by linguists without the need for tools, this means that we will have to rely on other ways to get the translation done.

If the raw syntax is too complex and that we have to support some sort of "linguist friendly format", I am not too sure how this will work for some inflection problems (e.g. adding language-specific syntax by language specialists).

Additional context or examples Based on personal experience I have seen linguists directly edit several existing syntaxes such as:

mihnita commented 4 years ago

Just a note: linguists don't handle the syntax directly, they use dedicated tools. So it does not matter from that side.

We should be able to map our data model (independent of syntax) to the most common data model supported by localization tools.

nbouvrette commented 4 years ago

Just a note: linguists don't handle the syntax directly, they use dedicated tools. So it does not matter from that side.

I disagree with this and I think it would be a good topic so see how others are translating MessageFormat today.

From my experience, we decided to train linguists directly with the raw syntax and it worked really well - the main reasons for this:

Keep in mind that MessageFormat has been there for a while and is still in a state that I would consider inadequate in terms of tooling (XLIFF is similar).

The localization industry moves slowly, so if we expect to have tools to use the new syntax, we know that adoption will most likely also be slow.

I would prefer simple, linguist friendly syntax :)

mihnita commented 4 years ago

XLIFF has a lot more adoption than MessageFormat, which is not as used, even if it was around for a long time.

The adoption is not about syntax.


The bigger difficulty is the data model. If that is too complex and has no decent mapping to the way existing localization chains represent the data then it will not be supported. Writing an import / export filter is easy, if the data models are close enough. But if they are too far away it would require architectural changes in the way everything else works (translation memories, leveraging, validation, UI, etc.)

This is the same as creating COBOL to sound very English-like in the hope that programming can be done by business people. Programming is hard because it is hard, not because the program does not read like English.

Linguistically correct handling of translation is hard because human languages are messy.


The second part of the (non)-adoption is about politics. XLIFF was created to allow for localization vendor / tooling independence. So the tool vendors that had de-facto monopoly had no interest to implement it properly. I know of one of the major tool that had (has?) it's own extension to represent basic stuff like comments, even if xliff has a standard way to do it. Major clients (without naming names) didn't help either, by creating their own tools that supported some of the advanced features they wanted. And that created a different kind of lock-in.


This is why I am (sometimes a bit too?) aggressive in rejecting complicated features that will prevent adoption.

Saying "translators should be able to handle it directly" means we give up on adoption from the get go. We design it so it will not be adopted.

nbouvrette commented 4 years ago

The adoption is not about syntax.

I agree it's not all about the syntax - I think its a combination of the followings:

Writing an import/export filter is easy, if the data models are close enough.

This is the part I'm not sure I can picture easily? We didn't talk too much about continuous localization yet, but I think that related to file formats if we keep the current state that MF supports, which from my experience is both simple in terms of syntax but also file format agnostic in terms of storing the syntax in keys - this should align well with the simplicity side of the solution.

If you think that to adopt the new syntax we will need to have all users implement some sort of import/export filter, then I would be curious to understand where this fits in a continuous localization pipeline? And also why this is needed because it does increase complexity.

The second part of the (non)-adoption is about politics.

I think we have to be careful with the word "politics" - it can easily be used the wrong way :-)

So the tool vendors that had de-facto monopoly had no interest to implement it properly.

That's my point - if vendors are not interested, then what? businesses use vendors... so for a new solution it is a dependancy - I think it is important to either:

1) Make it interesting enough for vendors to implement at scale (very challenging) 2) Or make it simple enough to not require much from the vendors (this is what I'm proposing)

Major clients (without naming names) didn't help either, by creating their own tools that supported some of the advanced features they wanted. And that created a different kind of lock-in.

Yes, this is a reality of the localization landscape, which I think is mostly due to the vast complexity of languages and the lack of a solution that covers all needs.

Saying "translators should be able to handle it directly" means we give up on adoption from the get go. We design it so it will not be adopted.

I'm really not sure I understand what you mean here, are you saying we cannot keep the syntax as simple as it is today with MF? Because if we can, my experience is that this will help adoption, not the opposite.

If we can't, then maybe you are seeing something I am missing?

jamuhl commented 4 years ago

XLIFF was created to allow for localization vendor / tooling independence.

Did it achieve that?

Isn't xliff more a catalog format than a syntax? Might be not really understand why xliff is that important beside trying to be a interchange format...

At locize we support the most basic stuff out of XLIFF - to import/export -> but most time the files we get from other vendors are zero compatible with our tooling...our approach is rather different to the old ones -> keeping things simple and close to the runtime format to enable a real continuous localization (publishing usable runtime translation files to CDN)

Fleker commented 4 years ago

Tools are insufficient - poor integration with most major commercial TMSes

I think the poor tooling is a separate issue. While I do agree having linguists learn the syntax can be beneficial, I wouldn't want it to be a primary goal of the format. There should be better tooling in general.

mihnita commented 4 years ago

Continuous localization as nothing to do with adoption.

XLIFF was created to allow for localization vendor / tooling independence.

Did it achieve that?

Partially. If you choose the right subset of the XLIFF features then yes, it is supported.

Isn't xliff more a catalog format than a syntax? Might be not really understand why xliff is that important beside trying to be a interchange format...

I see it as more than a catalog. There is structure there. There are groups, text units, segments, placeholders. There is info about placeholder types (open / close / standalone), flags for the placeholders, and ways to map from format-specific placeholders to the xliff (language independent) ones.

With proper XLIFF support one would be able to translate ...{user}... and leverage ...%s... and ...$user... and ...{0}... without damaging anythging. Would be able to leverage between formatted content. etc.

In a way I think xliff 1.2 was too ambitious.

mihnita commented 4 years ago

Here is a pretty good document on XLIFF adoption: https://www.localizationworld.com/lwdub2014/feisgiltt/slides/Chase_Keynote.pdf

It touches adoption, feature creep, and more.

mihnita commented 4 years ago

Writing an import/export filter is easy, if the data models are close enough.

This is the part I'm not sure I can picture easily?

This is about the way current localization tools represent data.

They are very much geared on a 1:1 mapping. I'll list here some of the assumptions the current workflow make (partial list):

Anything outside this view of the world breaks functionality. For example plurals.

The source (English) is:

Russian needs to send back 4 messages. Things break.

OK, "flatten" all messages into 1 and allow translators to edit things directly (your solution, and Fluent, for more complex messages) The side effect: the rest of the tooling stops working properly.

These are data model incompatibilities. Are a lot harder to change, and are not solved with a simple filter, or "let translators change the string as they see fit"

mihnita commented 4 years ago

Or make it simple enough to not require much from the vendors (this is what I'm proposing)

This breaks a lot of the useful features the vendors built into their tooling.

It is like saying "make things simple enough so that developers don't need an IDE, they can use a simple text editor" and "a linter / compiler will tell them if there are problems"

So now they lost not only syntax highlighting, but refactoring, and suggestions ("intellisense" or whatever you want to call it) and more. (yes, translators also do "refactoring" of sort, with global terminology updates)

This just pushes back the CAT tools to the "dumb text editors" stage.

That also means lower linguistic quality (forcing translators to focus on technical aspects). And slows them down. A lot. In a world where translators are payed by (source) word count this matters to them. A lot.

Imagine someone tells you "starting tomorrow you can't touch Eclipse / IntelliJ / Visual Studio, you develop software in Notepad"

zbraniecki commented 4 years ago

Imagine someone tells you "starting tomorrow you can't touch Eclipse / IntelliJ / Visual Studio, you develop software in Notepad"

I agree that we should leverage power of CAT tools, but if we're designing for the Open Web, and something that we aim to suggest for a standard, then the reverse of your claim is also true:

Imagine if someone told you that in order to write in a new programming language for the Web you need to use Eclipse, because there's no way for you to work with it in anything else.

The principle of least power is described by W3C - https://www.w3.org/2001/tag/doc/leastPower-2006-01-23.html - and I believe we should aim for our system to work with notepad.

Then, if you plug some tooling, it should work better.

And if you use a CAT tool it should work amazing. And we should make it easy to develop CAT tool integrations, command line tooling, etc. But I'd like to avoid a future in which our outcome is basically unreadable/unwritable without sophisticated CAT tools. I don't think anyone is advocating for it, but in the push for "CAT tools will fix X" we may end up there non the less if we're not careful.

mihnita commented 4 years ago

I think I did not touch all points:

Good integration with existing tools - considering all programming languages and TMSes (I think flexibility is key)

This is where there is friction. Programming languages are flexible, and value flexibility. TMSess are a lot less flexible.

If you think that to adopt the new syntax we will need to have all users implement some sort of import/export filter, then I would be curious to understand where this fits in a continuous localization pipeline? And also why this is needed because it does increase complexity.

This I don't understand. There will be a filter, not matter what. Any tool we use (unless is a 100% dumb text editor) needs to "understand" a syntax, and convert from a file format to an internal, in memory representation. Unless one implements a tool for our format ONLY, and the internal data structures map exactly to our data model, then they will need a filter.

It is like asking "why an image editor needs a filter for GIF files". Unless the internal model assumes palette based 256 colors, transparency is on / off (not degrees of transparency), then you need a filter. I see not connection between continuous localization and the need for a filter, and I don't see how having one makes thing more complicated.

I'm really not sure I understand what you mean here, are you saying we cannot keep the syntax as simple as it is today with MF? Because if we can, my experience is that this will help adoption, not the opposite.

The current MF syntax is not well supported. Got some adoption, but not really enough, for something that was around for about 15 years. Or, got developer adoption, but not TMS support.

And yes, I say that we can't keep the syntax as simple as it is today. Because we want to add TONS of extra new features. Inflections are hard, and will make things harder. Again, nothing to do with the syntax. They are hard because human languages are hard.

mihnita commented 4 years ago

The principle of least power is described by W3C - https://www.w3.org/2001/tag/doc/leastPower-2006-01-23.html - and I believe we should aim for our system to work with notepad.

I 100% agree, if this means "engineers can use Notepad"

The main thing is where we draw the line. How much quality + productivity we want to sacrifice, how much training we need to provide, and how error-tolerant we want to be, before we say "translators should be able to..."?

Reality is that most designers are unable to properly write HTML + CSS without tools. Some do.... But I don't think anyone expects translators to translate HTML without tools. And I am pretty sure no company represented here does that. So why design for something that nobody will ever do?

zbraniecki commented 4 years ago

But I don't think anyone expects translators to translate HTML without tools.

Professional translators, probably not.

And I am pretty sure no company represented here does that.

I suspect you're right.

So why design for something that nobody will ever do?

Here's the important point I'm trying to make - we're not designing it for ourselves. Or, if we are, we should be very clear about it and claim that we're designing a localization system for major organizations and prominent stakeholders with budgets large enough to sustain l10n/intl engineering team and tools in house.

The Web, in its ideal form, is intended to lower the entry barrier and allow anyone with a notepad to write HTML, then add some CSS and maybe some JS, host it under their IP and anyone else in the World, with any browser, should be able to open it.

I know, we can argue how far we're diverging from the ideal with a small number of big players owning substantial amount of traffic this day and playing the golden cage game with the users, but the open standards are intended, as far as I'm aware, to target the Notepad user.

So, just like HTML, JS and CSS cannot be made more convenient for Google, Apple, Facebook and Twitter at the cost of the increase of the entry barrier, I believe we should aim for the Notepad user to be able to add L10n to HTML/JS/CSS model.

It's important to me to stress that I'm not advocating to set our success criteria for successful, continuous localization at scale at serving the "user with a notepad" scenario. I am invested in helping us design the system that can aid tool production and CAT tool UX and other aspects that will disproportionately help major localization companies and their customers. But I do believe that one of the challenges of designing an open web standard is that you can't just focus on this user group. We can build ivory tower on top of our system. all big deployments will, and the foundation should hold, but we should ensure that the sophisticated systems, toolchains, and UXes are not required to localize your JS app or your website.

romulocintra commented 4 years ago

Agree with @zbraniecki in

Here's the important point I'm trying to make - we're not designing it for ourselves. Or, if we are, we should be very clear about it and claim that we're designing a localization system for major organizations and prominent stakeholders with budgets large enough to sustain l10n/intl engineering team and tools in house.

I think we must design for the most simple use case and for the user that has fewer resources but keeping in mind all the tooling and scaling in cases that are needed.

IMHO this mindset must be one of the drivers. It was since the beginning when we only wanted to bring MF to Browserland. Now we are trying to go little deep and wide, this down-scale or up-scale design is natural but we must try to balance it

mihnita commented 4 years ago

So, just like HTML, JS and CSS cannot be made more convenient for Google, Apple, Facebook and Twitter at the cost of the increase of the entry barrier, I believe we should aim for the Notepad user to be able to add L10n to HTML/JS/CSS model.

I don't think anyone advocated "more convenient for 'the big X' players" or "something that CAN'T be edited in Notepad"

If we make it possible for programmers to create this kind of messages in Notepad, is that good enough? Because "as long as you train a translator enough" then they will be able to do the same. There is no barrier of entry.

The think that "bothers" me are requirements like "especially by non-technical users like linguists" or "editable by translators without any tools". Without defining that "translators" are.

It is like arguing ".svg is open, because it is plain text, and is editable by designers" Really? Is it? How many designers create SVGs without tools? Or animated GIFs?

"Notepad" is a red herring. The problem is the complexity of the data that we need to represent, not text of not.


"Professional translators, probably not."

Maybe you touched here on something that probably we all know, and probably should make it more clear. We most likely "color" the meaning based on our own experience.

But there is no such thing as "translator"

There is a huge difference between a techie who decides to translate an open source project (because it is a nice tool and wants to give back) and a translator who pays the bills from the translation work.

We should be able to support both cases. The first is relatively simple: if a developer was able to write the English message, a "techie" would be able to do the same for her / his language.

But if we design something that is not "localization tool friendly" then we have no adoption. And "it is plain text" or "simple syntax" is now what makes it localization tool friendly.

nbouvrette commented 4 years ago

I think we all agree that simplicity is preferred over complexity.

I would like like to call out a few things for the sake of alignment:

Now I completely understand that by adding new features there will be trade-offs on simplicity.

I do believe that if we hit a level of complexity that makes it non-Notepad friendly we will have to clearly have solutions around:

I don't have these answers, but all I know is that the simpler the solution, the fewer questions we will have to answer.

stasm commented 4 years ago

And yes, I say that we can't keep the syntax as simple as it is today. Because we want to add TONS of extra new features. Inflections are hard, and will make things harder. Again, nothing to do with the syntax. They are hard because human languages are hard.

I think this is related to the design principles discussion which I'd like to start in #50. In particular, the question about how "computational" vs. how "manual" we want the standard to be.


On one end of the spectrum we have a data model which encodes a wide range of linguistic features, allowing grammatically correct interpolations with proper spelling. Every noun, adjective and verb is defined somewhere else with all possible inflections, plurals, capitalizations and articles.

"Close {$object article:definite capitalization:lower}"
→ Close the tab

"Close {$object article:indefinite capitalization:title}"
→ Close a Tab

This model comes with inherent complexity. It potentially allows a lot of new and interesting features, but its compatibility with existing data models is unknown.


On the other end ("manual") of the spectrum, the data model is mostly a simple store of messages written out as full sentences. When you need a new variant, you create a new message. Some flexibility is introduced by means of many-to-many relationships, like MF's plural, but mostly, it's just full sentences.

"Close the tab"
"Close a Tab"

This model could have a fairly simple data model and syntax. Interestingly, it "supports" a lot of linguistic complexity by means of plain text. Just not in a way that allows to computationally produce new messages.

(Some messages will still require dynamic features, like variable interpolations, so things will never be a simple as just plain strings.)


The computational model is great for constructing sentences from smaller pieces of highly dynamic data, when it's impossible to compile a list of all possible combinations. A good example are voice assistants like Siri. It's also good for enforcing consistency between translations.

The manual model works well for UI where most messages are static. It's simple to translate and it's simple to translate correctly. OTOH, it leads to many more messages and consistency needs to be enforced through external tooling like translation memory. But ultimately, it's also more likely to be compatible with the lowest common denominator of data models currently used in the LSP industry.

mihnita commented 4 years ago

you can easily disable placeholder protection and/or any other blockers in most commercial TMSes (trust me, I did it!)

Of course you can :-) But that means you throw away a lot of the usefull features that TMSes have.

I did try the current MessageFormat syntax, with professional linguists (on a large scale), with "Notepad" (and commercial translation TMSes) - and it works quite well

Then you've been lucky to have relatively simple messages. I've seen developers asking for help debugging messages with nested plural / gender / select.

Do we need TMS support?

Absolutely!

How will this work in a continuous localization setup, where engineers often see localization as a utility

I am not really sure how does it matter. In most places I've been the systems used are as easy as "submit your files in version control, and in X days you have back X languages". And where was not the case I've setup these kind of system myself. It is completely "invisible" for developers how localization is done.

mihnita commented 4 years ago

Anyway...

I am absolutely not arguing that we should design something complicated. We all agree that simple is better.

And I agree that we want something that developers can edit directly with a simple text editor. If that happens to be good enough for a translator with training (how much?), that's great.

But we must be able to export it to standard TMS tools.

nbouvrette commented 4 years ago

I did try the current MessageFormat syntax, with professional linguists (on a large scale), with "Notepad" (and commercial translation TMSes) - and it works quite well

Then you've been lucky to have relatively simple messages. I've seen developers asking for help debugging messages with nested plural / gender / select.

We have seen the same situations but typically we will ask developers to simplify their messages.

There is a thin line between good use of MessageFormat and usage that can make it impossible to localize.

This is typically resolved by having the ability to have a dialogue between linguists and engineers and also by having training material available for engineers.

Do we need TMS support?

Absolutely!

Then we will have to figure out why MessageFormat is still not sufficiently supported today and how to remediate this situation.

How will this work in a continuous localization setup, where engineers often see localization as a utility

I am not really sure how does it matter. In most places I've been the systems used are as easy as "submit your files in version control, and in X days you have back X languages". And where was not the case I've setup these kind of system myself. It is completely "invisible" for developers how localization is done.

I think it does matter because if you want to make a solution available at scale, we cannot expect all companies to build custom solutions to support it.

Now we could expect the TMS to handle any sort of "conversion" if we need to - but my question about adoption remains.

And as I mentioned during our last call, the more we discuss the more I wonder which new scenarios should be supported by the new syntax. Most linguistic problems I have seen so far seems very tricky to support from a syntax perspective.

For example, the indefinite article (a/an) in English could be probably an easy one to add to the syntax. The rule is relatively simple.

Now if you try the same in French (le/la/les) you will need to know the gender and the plural form for the target word or group of words. Does this mean that the syntax would propose a data model for this, or would it also provide a "dictionary"?

And then if its only the data model, do we know how many people will want to use such features, what common problems this will solve and how much would it cost for a company to solve this at scale.

Imagine hundreds of thousands of geographic entities that need to have this data for one language. How many companies can afford this?

I think we need a backlog :)

mihnita commented 4 years ago

We have seen the same situations but typically we will ask developers to simplify their messages.

At times there is no way to simplify the message, the structure of the language is complex. This is what I am trying to say, but I don't seem to manage. In fact, developers have a tendency to create simple messages, and then they need to make them more complex to work in other languages. Usually because English is relatively simple (in some respects).

I think it does matter because if you want to make a solution available at scale, we cannot expect all companies to build custom solutions to support it.

This is why I keep saying that we need a standard mapping to XLIFF. And when you asked "Do we need TMS support?" I answered "Absolutely!"

I think that the rest of the message (a/an; la/le/les/l'; etc) belongs in a different issue? Probably the one about inflections?

I agree that these are hard problems, but they are not about syntax.

nbouvrette commented 4 years ago

At times there is no way to simplify the message, the structure of the language is complex. This is what I am trying to say, but I don't seem to manage.

If you have an example maybe it would help picture a bit better?

The way I picture this, the syntax should be used at the sentence level since most TMS do segmentation at that level as well.

The most extreme (legitimate) case I can imagine would be a sentence with a variable that requires gender (typically a user), and 2 other variables with plurals. But how many times does this scenario occur? And, to be honest I don't think any current TMS support I have seen could help with this. The solutions I could see around these type of extreme scenarios would be:

But then again, if adoption is our priority, I know which one I would prefer, especially if this scenario accounts for 0.001% (guesstimate here) of cases.

This is why I keep saying that we need a standard mapping to XLIFF. And when you asked "Do we need TMS support?" I answered "Absolutely!"

Maybe I'm not familiar enough with XLIFF to see how this would work - are you proposing that the base storage format would be directly XLIFF? Otherwise, this is where the continuous localization topic (conversion scripts?) will be required.

And if this is what you are proposing, then we need to make sure that the XLIFF feature you have in mind are also supported broadly by most TMS, otherwise, we are back to square one.

I think that the rest of the message (a/an; la/le/les/l'; etc) belongs in a different issue? Probably the one about inflections?

I agree that these are hard problems, but they are not about syntax.

To me, I was picturing that inflection could be solved using the syntax which is why I brought this topic back here. Here is an example of what I had in mind:

{length, singular {{#, indefiniteArticle} minute walk.} plural {{#, indefiniteArticle} minutes walk.}

Maybe you have something else in mind?

asmusf commented 4 years ago

On 3/1/2020 11:19 AM, Nicolas Bouvrette wrote:

The most extreme (legitimate) case I can imagine would be a sentence with a variable that requires gender (typically a user)...

Why do you say that natural gender is more of a problem than grammatical gender?

nbouvrette commented 4 years ago

Why do you say that natural gender is more of a problem than grammatical gender?

My presumption is its a more common problem but I might be wrong. For example, it's very common for applications to have users, but maybe less to have the user specify their gender (other than very specific applications).

I'd like to hear back from the group if they have examples where they require grammatical gender - I have a few in the space I work in but we are not using ICU to solve these problems. Depending on the size of the dataset, solving these problems can be quite expensive which is also why I presume they are less commonly solved as well.

asmusf commented 4 years ago

On 3/1/2020 7:03 PM, Nicolas Bouvrette wrote:

Why do you say that natural gender is more of a problem than
grammatical gender?

My presumption is its a more common problem but I might be wrong. For example, it's very common for applications to have users, but maybe less to have the user specify their gender (other than very specific applications).

If you have a message where a parameter is a noun with grammatical gender, but the message also contains an adjective or article and you want the latter to track plurals, then they also need to track gender in many languages.

You may be able to avoid this in some cases by making the parameter cover the entire noun phrase, or writing messages that attempt to circumvent this problem. But I thought that the current effort was partially intended to avoid such defensive designs.

On a more general level, I wonder whether it wouldn't be useful to have a reasonably exhaustive set of "standard examples" that we expect the syntax to cover. It's really not possible for anyone to understand all the requirements in the abstract, because not all of use have the same working experience of types of messages and types of languages.

With a set of canonical examples (together with pseudo translation into English that reveals the relevant contraints), it's much easier for anyone to convince themselves that a proposed syntax is only as simple as possible, not not more.

jamuhl commented 4 years ago

I think we are now at an important point in this project where we really should decide on the scope of it.

When @romulocintra reached out to me - for me the point was bringing the best fitting i18n format to the browser as something like Intl.messageformat (be it the messageformat or the already extended fluent format). Personally (even maintaining i18next with its own format) I don't care which format it will be in the end...everything is better than having nothing like right now. Because agreeing on one format will automatically lead to higher adoption of that format.

I understand - this looks like a good time to add more features to the syntax - but as I currently experience this discussion with adding too much we will kill the format for the small business. Not every business will have the starting money to buy into a TMS or building up an inhouse solution.

In my opinion, we should keep the scope of this project as small as possible:

I mean we got @zbraniecki from fluent, @longlho from react-intl and @eemeli from messageformat (and me from i18next. And I'm rather sure we could get @kazupon onboard from vue-i18n). Just a guess but with those js libs we cover over 90% of the web/js projects out there. I'm no linguist and got just an idea of how complex some languages can be - but I can at least say those are not too often a problem for the users of my lib.

nbouvrette commented 4 years ago

If you have a message where a parameter is a noun with grammatical gender, but the message also contains an adjective or article and you want the latter to track plurals, then they also need to track gender in many languages.

Fully agree that solving this is very complex, but this is why I keep asking "What is the size of the data".

As you mentioned, for small dataset there are ways around this:

Of course, all these strategies do not scale well - but which companies out there have the big data issues and do we need to provide a full solution for them or simply the foundations to help them get there.

On a more general level, I wonder whether it wouldn't be useful to have a reasonably exhaustive set of "standard examples" that we expect the syntax to cover.

+100

I think documenting current issues with potential real use cases and solution (can be pseudo syntax) would help determine the priorities. I am tempted to start a new Git issue on this but I wonder if this is the right tool for such an effort.

adding too much we will kill the format for the small business.

Fully agree on this as well - you can have the best solution but if its too complex, it will surely be used by a minority. I think everyone here wants to provide a solution at scale for common i18n problems. Now, do we know what those are?

DavidFatDavidF commented 4 years ago

Hi all, I would like to second Mihai's sentiment that the data model needs to be mappable to XLIFF.

You can argue that XLIFF doesn't have universal support but - based on my commercial localization (large scale) experience - it is at the core of all solutions that scale (solutions built by companies such as Microsoft, Oracle, IBM, etc.) The industry is indeed extremely fragmented and immature (ever growing with an entry threshold close to zero) and most actors in the industry are incapable of using proper processes because they are in reactive mode or worse. Nevertheless, it doesn't mean that a standardisation effort should mimic the reactive approach of the chaotic majority that doesn't scale. Adopting proper XLIFF compliant tooling is not too difficult actually, and as a buyer, the simplest thing to do is to produce a standard package and say in the RFP that the format is XLIFF 2.1 (XLIFF 2.0 backwards compliant) and the bidders will comply because the market is extremely competitive. Most of the services and tooling market leaders want the buyers to believe that standards are not supported and encourage them to submit all sorts of crazy non internationalized source formats for direct localization because it allows them to build insane labor intensive solutions that will lock in the buyer with them indefinitely.. But generally speaking if a buyer says "jump", they will ask "how high and how many times?". So it should be the buyers' procurement (informed by technical champions) responsibility to say "I want you to translate these XLIFF 2 packages" instead of "Train for me people that will be able to directly edit this or that sort of syntax", "Extract text for localization from pdfs because we don't know where the source content is.." There is strong OSS support for XLIFF (low level libraries in java, .NET), the functionality doesn't need to be built from scratch, and it's especially easy to adopt the core (advanced functionality can be added later due to modularity of the data model) and all major localization providers are able to handle XLIFF 2 if required, they just don't advertise this capability because their business level decision makers believe in lock in rather than in standards based interoperability. All SDL products (as of 2017) do support an XLIFF 2 roundtrip, other tools that support XLIFF 2 roundtrip include memsource, xtm, Lionbridge's logoport disguised under many marketing whitelabels, OKAPI Ocelot (OSS), etc. Most of the leading tools don't support XLIFF extraction and merging though and I believe it should be the buyers concern to extract to and merge back from XLIFF because it is them who know best their source format. Here is an informative spec produced by GALA that helps people build proper extractors/ergers https://galaglobal.github.io/TAPICC/T1/WG3/rs01/XLIFF-EM-BP-V1.0-rs01.xhtml It has also code examples and counter examples, so be sure to look at them.. The section 2.4 https://galaglobal.github.io/TAPICC/T1/WG3/rs01/XLIFF-EM-BP-V1.0-rs01.xhtml#Hints

will give you an idea what sort of operations are allowed/supportable on inline codes during a localization roundtrip..

The basic idea of XLIFF is that of masking inline code/annotations/whatever artifacts devs fancied to put inside of their natural language content. The masking is done in a technology agnostic way. You can extract any sort of syntax into XLIFF and even more, the same masking data model is not tied to XML only. XLIFF OMOS TC at OASIS generalizes the XLIFF model https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff-omos and is working on JLIFF https://github.com/oasis-tcs/xliff-omos-jliff the JSON serialization of the same data model.

Proper internationalization should of course to strive to minimize the amount of cose within content but of course it's not always possible..

XLIFF also solves the structural and project management issues but I'd say this is out of scope for a message format discussion.

I think the key is to preserve the data model assumptions that make internationalization and localization possible. Whatever the agreed message format ends up being, it should be tested on XLIFF (or JLIFF) roundtrip capability.

If you create something that a linguist is supposed to edit directly, this might seem SME friendly, but it doesn't scale. ideally you want your format to be easily supportable by tools. But the format will not be supportable by tools if it violates the basic set of data model assumptions that Mihai outlined early in this thread..

Cheers dF

Dr. David Filip

ISO/IEC JTC 1 PAS Mentor | Convenor, ISO/IEC JTC 1/AG 3 Open Source Software, Convenor, ISO/IEC JTC 1/SC 42/WG 3 Trustworthiness of AI | National mirror chair, NSAI TC 02/SC 18 AI | Head of the Irish national delegation, ISO/IEC JTC 1/SC 42 AI | Chair & Editor, OASIS XLIFF OMOS TC | Secretary & Lead Editor, OASIS XLIFF TC | NSAI expert to ISO/IEC JTC 1/SC 38 Cloud Computing, ISO TC 37/SC 3 Terminology management, SC 4 Language resources, SC 5 Language technology | GALA TAPICC Steering Committee Member Spokes Research Fellow ADAPT Centre KDEG, Trinity College Dublin Mobile: +420-777-218-122

On Mon, Mar 2, 2020 at 2:10 PM Nicolas Bouvrette notifications@github.com wrote:

If you have a message where a parameter is a noun with grammatical gender, but the message also contains an adjective or article and you want the latter to track plurals, then they also need to track gender in many languages.

Fully agree that solving this is very complex, but this is why I keep asking "What is the size of the data".

As you mentioned, for small dataset there are ways around this:

  • Include articles with the data
  • Have full sentences that will cover all the different dataset
  • Change the sentence to make it simple (the old ":" trick before a list of items)

Of course, all these strategies do not scale well - but which companies out there have the big data issues and do we need to provide a full solution for them or simply the foundations to help them get there.

On a more general level, I wonder whether it wouldn't be useful to have a reasonably exhaustive set of "standard examples" that we expect the syntax to cover.

+100

I think documenting current issues with potential real use cases and solution (can be pseudo syntax) would help determine the priorities. I am tempted to start a new Git issue on this but I wonder if this is the right tool for such an effort.

adding too much we will kill the format for the small business.

Fully agree on this as well - you can have the best solution but its too complex, it will surely be used by a minority. I think everyone here wants to provide a solution at scale for common i18n problems. Now, do we know what those are?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/48?email_source=notifications&email_token=AAOXW43GMLT4WFR7DBY5WWLRFO457A5CNFSM4KV3AFGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENPOAJQ#issuecomment-593420326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXW45EYTXOAECS766EUGLRFO457ANCNFSM4KV3AFGA .

aphillips commented 1 year ago

This appears to have been addressed by the adoption of EBNF and later ABNF syntaxes. It is also a bit non-specific: it is a design principle that I think this group aspires to hold up.

aphillips commented 1 year ago

As mentioned in today's telecon (2023-09-18), closing old requirements issues.

Note: this specific issue was a topic of interest at the face-to-face and in the feedback we received in Seville and there is work on simplifying the syntax as a result.