openstreetmap / openstreetmap-website

The Rails application that powers OpenStreetMap
https://www.openstreetmap.org/
GNU General Public License v2.0
2.16k stars 909 forks source link

Separate translations for "zero" and "zero like numbers" #3997

Closed maro-21 closed 1 year ago

maro-21 commented 1 year ago

Translatewiki knows better how something should be translated...

translation1

Translation of the text in the red box: "Plural forms should be defined as ... This translation includes ..."

But when I remove "|zero=Brak zgłoszeń" it allows me to upload. translation2

Is it possible to disable this validator? Alternatively replace it with a warning (yellow color)?

tomhughes commented 1 year ago

Translatewiki is outside our control - if you have a problem with the way it operates then you will need to address it with them.

Nikerabbit commented 1 year ago

This is OSM problem in my opinion, and I believe I have reported this earlier already. The zero form should not be used in English. If you look at for example Latvian at https://unicode-org.github.io/cldr-staging/charts/37/supplemental/language_plural_rules.html you can see that the zero form is not just number 0.

For most languages this is not a problem, but for languages like Latvian, the translation is either wrong or it is impossible to provide translation for "no X".

This can be resolved in two ways:

tomhughes commented 1 year ago

The problem is that https://cldr.unicode.org/index/cldr-spec/plural-rules is self contradictory - it both says that the zero category can sometimes cover all numbers that end in zero and also encourages treating it specially so you can do things like "No friends" in place of "0 friends".

gravitystorm commented 1 year ago
  • Have a separate message when N is 0. This works for all languages, though it is a bit more work.

I'm happy to make this change. We currently have the zero form in two places, so it's not much work to sort out.

I wasn't previously aware about the problem with using the zero form, but it makes sense to replace it with a separate translation, given the Latvian issue.

tomhughes commented 1 year ago

Well it's not something we can do on our own - it's the i18n library that decides how to map counts to labels.

gravitystorm commented 1 year ago

Well it's not something we can do on our own - it's the i18n library that decides how to map counts to labels.

I think we can do it ourselves like:

if count == 0
  t '.no_whatevers'
else
  t '.whatevers', :count => count
end
tomhughes commented 1 year ago

Well obviously we can do that but I was dismissing the idea of doing that to everywhere where we lookup a counted translation as impractical and/or having way too awful an effect on the code.

I'd monkey patch the pluraliser before I did that!

tomhughes commented 1 year ago

Already discussed at https://github.com/ruby-i18n/i18n/issues/629 and https://github.com/ruby-i18n/i18n/blob/master/lib/i18n/backend/pluralization.rb#L17.

Unfortunately although they support 0 and 1 as keys for "exactly 0" and "exactly 1" the :zero key still seems to take precedence if it exists :-(

tomhughes commented 1 year ago

I guess so long as locales that need :zero don't also want explicit zeros it works but it would be better if the lookup in i18n was reversed.

verdy-p commented 1 year ago

And there are otherproblems: there are languages that need more than 2 forms, and the syntax using CLDR-like labels maps them differently for different values depending on locales (that may need up to 6 forms, adding also "two=", "few=", and "many=", in addition to "zero=" which may be distinguished; all these 5 labeled forms should defautt to the last unlabelled form; but some languages have different fallbacks for these forms before using the last unlabelled form) Not all European languages consider zero being a plural, for example French maps "zero=" to a singular, so that it fallbacks first to the "one=" label. Many East Asian and Southeast languages do not have an distringuished grammatical form (instead, translations may contextually use some other features, such as repeating a noun, but not adjectives, or adding some other word in the sentence such as some adverb.

So if you want to specialize the case "0" for expressing a negation or absence (including with a completely diferent sentence or presentation, e.g. with different colors, or some emphasizing suggesting an error or warning, it makes senses to provide a separate translation. But live then the "zero", "one", "two", "few", "many" and unlabelled default follow the normal CLDR rules according to each language.

With the Mediawiki syntax of "PLURAL:" (not "PLURAL" without the colon), we don't have such problem: we can specialize the 0 form by labelling it not with "zero=" but by "0=" with a decimal numeric value (which always takes precedence to labelled forms and to the default form, where the value matches.

As well the MediaWiki "PLURAL:" passes the value to match explictly, and so it allows different plural forms to be used ni the same message containing different values. This is not the case with the i18n library that jsut passes it explicitly and this causes problems because it can only support opluralization on parts of the message, and requires then creating "patchwork" messages; if the application does not care, and just appends mulmtiple substrings, the result may not be correct due to incorrect assumptions on the placement of plural forms within a complete sentence (for example the plural may need to be marked simultaneously on different places, or the logical order of the sentence may need to be changed. Passing the value explicitly allows placing one or more "PLURAL:" selectors in the Mediawiki syntax. As well it is possible to drop it in translations (though we can use an empty {{PLURAL:$1|}} to avoid the TWN warning that suggest that it was forgotten and marks the message as "fuzzy".

Such empty {{PLURAL:$1|}} tags present in TWN translated messages may be exported and discarded automatically in the imported message to the project's code repository for building the final app to be released and deployed (it could as well reduce {{PLURAL:$1|word}} tags to jsut the "word" when there's a single form listed). Such cleanup can be part of the import/export tools used between the OSM project and TWN. These marks are still useful in the message sources, to document what is supported and where plural forms can be used, because the app may be able to pass the effective numeric value, even if it is not used in the final translated message. As well these import/export tools implementing such automatic filters/cleanups could jsut use in TWN the Mediawiki syntax, making the conversion themselves. But adding some small i18n function in the OSM project to do that conversion would not be a huge development, even if OSM does not use the Mediawiki parser. This just requires basic regexps on messages and take into account the values passed in parameters inside a small associative array for variable names "n" used in {{PLURAL:$n|...}}. What you do with the regexp matchs is then trivial code.

And once this is doen you'll get more freedom and no longer need to repeatedly handle support requests to fix each new message you add or messages that you want to modify in English (and that suddently gets broken again with these updates). For now you loose flexibiliity for yuor own projects, and localization to all other actual languages other than English is still constantly broken or very late long after your planned and executed releases.

tomhughes commented 1 year ago

Yes we know there are other forms and I18n supports them. Can we please not get distracted here.

Yes there are issues about "lateral fallback" and that is discussed in the ticket I linked.

We're not going to invent our own I18n framework though - that would just be silly and is far beyond what we have the resources to maintain.

verdy-p commented 1 year ago

Once again, reported for Ukrainian: https://translatewiki.net/wiki/Thread:Translating_talk:OpenStreetMap/About_Osm:Issues.show.reports/uk

I don't understand the reply made by "We're not going to invent our own I18n framework though" which just means that he does not want to do anything (even if it would not require lot a work to provide a simple function to make the conversion and use for example the Mediawiki syntax with the same rules), and he apparently prefers having this bug reported again and again, possibly for hundreds of supported languages and for all OSM messages using PLURAL forms, and after each update.

This causes much more loss of time for many more people, and deserves the OSM project and its worldwide community if the resulting translations are grammatically incorrect or broken and not updated when needed at each evolution of the OSM website. I think it's time to invest some time seriously.

This is not a "distraction" as said above. And this does not mean you have to change or reinvent all your i18n framework, just define a single parsing function on top of it, simply because its current implementation is defective, made with nicorerct assumptions, and not working at all to correctly support plural forms in any language other than the source English! All these messages remain "fuzzy", and not updated, causing problems for a long time on the OSM site for any user not viewing it in English.

There's not a lot of messages in OSM using PLURAL, so integrating the midding 18n function in the code will be made easily with a small patch. Translators in TWN will then work correctly and won't then need to report that same bug again and again (and without any action taken on the OSM side). TWN admins cannot decide what OSM intends to implement, so they cannot decide how to fix the validator.

gravitystorm commented 1 year ago

I don't understand the reply made by "We're not going to invent our own I18n framework though"

If you don't understand something, then please ask for clarification.

[....] which just means that he does not want to do anything (even if it would not require lot a work

This is absolutely not what he said. I don't know why you took a statement that you say you didn't understand, and then go on for three paragraphs, making up conclusions and attributing bad intentions as you go along. It's not helpful.

Please be assured that we will fix this problem, that's why it was reopened.

verdy-p commented 1 year ago

Hey @gravitystorm, I can also tag your reply as "out of topic". I perfectly understood what was said and also repeated again above since a very long time (and this is not the fisrt time it is reported here).

My comment was on topic because it said that very small changes was need in the OSM-side code (I did NOT ask for a revolution or complte rewriting, jsut fixing a few places with a minimal function call to process the messages). And that would certaily save lot of time from OSM developers, and many translators, an OSM admins that need to fix all these incoming reports, without getting sure that this is what is desired (the way it has been partially fixed, multiple time, has also produced errors, and pointing users to change things in TWN that was not the initial cause of the problem, meant that many messages could not be translated and have been translated and validated the wrong way, then OSM improted them as is, even if they are clearly not ideal and contain grammatical errors in many languages)

Clearly, OSM developers have to do the small job with a basic function to write to process PLURAL in messages correctly, even if you continue using the existing i18n framewok. May be you don't know how to do that, because you just known British English, but then ask to some other OSM developers that speak other languages (notably Arabic or Hebrew, but also Celtic and Slavic languages like Irish and Russian) and that have understood the CLDR specifications (which are those used now as well by Mediawiki since many years, and needed for correct worldwide support).

Note also that CLDR does not have to specify specializations possible for specific numbers. CLDR has no contradiction, it just specifies grammatical classification needed for languages, and a mechanims for fallbacks from one plural class to another. Any application can still apply specializations for specific values to override the classification of these plural forms.

That's what MediaWiki offers and allows (without contradicting CLDR). Its syntax with explicit variables allows different varaibles in the same message to have different pluralization, without having to break messages into patchworks (e.g. "$1 has added $2 nodes, $3 ways and $4 relations on $date": just consider at what happens in German if you split that sentence so that there's a single implied variable, think about where variable values need to to placed in sentences, or reformulated in some languages for some values). Mediawiki then simply supports both "zero=" (for linguistic plural classes but only in languages that need it) and "0=" (as an override for all languages, allowing a different reformulation, including the possibility of returning an empty replacement text, or negating a verb or using an antonym).

Note that CLDR specifies also alternate forms for numbers with decimals (which behave differently than integers) or in scientific notations (including very big or very small numbers with limited precision and rounding): these allows treating 10.01 like 01 in grammatical plurals, but treating 10.018 like 18. But Mediawiki still does not support these forms for numbers with fractions

verdy-p commented 1 year ago

Note also that in the example above, even in English this is not correct. With CLDR-named classifiers like "zero=", the value does not mean that it is exactly "0", it means that it is a value having the same grammatical plural form as zero, so it shuold not be used at all for translating "No reports".

The generic CLDR-like plural classifiers are completely ignored in any language that don't have such plural classifier. English is such an example, as it only defines "one" (for the English singular) and "other" (for the default English plural), but does not define "zero", "two", "few", and "many" (so using "zero" in English is also incorrect).

But in languages that define a generic CLDR classifier, multiple actual values of the variable "%{count}" could map to that classifier, the value MUST be present in the returned text. In addition, if there's a classifier for the targetr language but it is missing in the message, the language defines which other classifer will act as a fallback (the "other" classifier for the last fallback should be present in the message, otherwise the value should be empty).

For translating "No reports", this can only be mapped as an numeric override "0=No reports" (not using any generic CLDR classifier), that takes precedence to any possible CLDR classifiers when the actual value of the parameter is exactly "0". But not "0.0" which is matched in English by its plural in the "other" classifier! (And also not "10" for a language that would map "10" in the generic "zero" class defined for that language)

The same applies to Polish, which also incorrectly uses "one=1 zgloswenie", instead of "one=%{count} zgloswenie" (using a genric classifier) or "1=1 zgloswenie" (using a numeric override).

In the MediaWiki "{{PLURAL:$variable|...}}" syntax (currently not used by OSM's i18n framework), the CLDR "other" classifier (defined in CLDR for ALL languages, including those that don't have variable forms for the grammatical plural) must not be specified explicitly as "|other=...", it must be the last option "|...}}".

The source message above using the OSM i18N framework in the English source also sets "|%{count} reports}}" without naming the generic "other=" classifier, so this should also be the case in Polish (the error reported by TWN is then correct). However for compatiblity it could be acceptable to name the "other=" classifier explicitly, without having to set a last value without assigning to any generic classifier or numeric override (but using simultaneously "|other=..." and "|...}}" would be ambiguous and should be invalid).