mathjazz / pontoon

In-place localization tool
https://pontoon.mozilla.org/
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

[decision] Where should we use "Number of literate speakers" info? #1064

Open mathjazz opened 7 years ago

mathjazz commented 7 years ago

This issue was created automatically by a script.

Bug 1343908

Bug Reporter: @mathjazz CC: etrapani@gmail.com, @akerbeltz, @flodolo, @guerojeff, paulrausch@gmail.com, @zbraniecki

Some facts:

  1. "Number of literate speakers" data is taken from CLDR: http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html

  2. There have been reports that the numbers are not always accurate: http://unicode.org/cldr/trac/ticket/10099

  3. Even if the absolute numbers are not necessarily accurate, they still provide an estimate of big a particular locale is compared to others.

  4. We aren't aware of any other data source we could use for this purpose.

  5. Data is shown on the Teams dashboard, on each project dashboard and in the heading of each team dashboard: https://pontoon.mozilla.org/teams/ https://pontoon.mozilla.org/projects/pontoon-intro/ https://pontoon.mozilla.org/gd/

  6. Data is useful at least for project managers, for example to identify big locales that aren't complete before the deadline.

In dev-l10n a suggestion has been made to adjust the way we use the "Number of literate speakers" information: https://groups.google.com/forum/#!topic/mozilla.dev.l10n/lNN0N_xt-Xo

Let's figure out where and when to present this information. See User story for more details.

Here are some of the options we have:

  1. Get rid of the data completely, including from the database.

  2. Keep the data, but only show it in the Locale Admin list.

  3. Keep the data, but only show it in the Locale Admin list and in the heading of each team dashboard along with other locale data.

  4. Keep everything unchanged, but only show the "Number of literate speakers" in locale listings to Admins.

  5. Keep everything unchanged.

mathjazz commented 7 years ago

Comment Author: @akerbeltz

Re 3 (in first post): This won't even give you that data. Let's take Austria. It makes the (ludicrous) claim that 95.0% speak Bavarian and that 98% of them are literate. The population of Austria is 8.4 million. That would suggest that there are 7.8 million people who are literate in Bavarian. That's so wrong it gives me a nosebleed (even though I'd like this to be the case). If there are 1,000 people literate in Bavarian I'd be surprised. So if any dev worked on the basis of this figure, they're be assuming 7.8 million users/speakers who simply don't exist.

I would get rid of the data completely (option 1) or simply replace it with a tier system i.e. if we split our locales into Tier 3 < 1 million speakers Tier 2 1-5 million speakers Tier 1 > 5 million speakers

That would give devs a very rough guide about whether a big locale is being impacted by translations being behind but without getting into details that we could argue about till the cows come home.

There are not so many locales on Mozilla. I would suggest pulling the data of Wikipedia rather than CLDR which in this instance is not reliable at all. Wikipedia isn't entirely reliable for speaker numbers either but if we did a very broad tier system such as I suggested, it wouldn't matter because it's a rough guideline and for rough speaker figures, Wikipedia is better than CLDR or Ethnologue.

mathjazz commented 7 years ago

Comment Author: @zbraniecki

I would suggest not to overblow an CLDR error. Is there an easy way for us to compare how much CLDR differs from data we could pull from Wikipedia?

I'd be curious to see if it's a meaningful chunk of data or a couple exceptions that we can upstream as bugs to CLDR.

mathjazz commented 7 years ago

Comment Author: @akerbeltz

It is not a single CLDR error. It's virtually all off.

I already filed a bug with CLDR but they work to their own timescales, in my experience, it will be at least half a year before a fix filters through - once it has been agreed what the fix actually entails and I suspect that will be a long process with many cooks.

mathjazz commented 7 years ago

Comment Author: @Pike

Michael, your assertions are way off. Take a look at https://de.wikipedia.org/wiki/Deutsche_Sprache#/media/File:Deutsche_Dialekte.PNG.

That said, I wish that CLDR had more documentation on why they change which number to which value. They'd effectively document the value of the data. But asserting that their data isn't data at all is just a fallacy.

mathjazz commented 7 years ago

Comment Author: @guerojeff

All this being said, I feel confident with the idea of replacing number of speakers with "countries where spoken." This would be useful information for us to be able to advocate for localization when leadership identifies marketing focus territories. Moreso than number of speakers in that region.

mathjazz commented 7 years ago

Comment Author: @zbraniecki

It is not a single CLDR error. It's virtually all off.

I'm sorry, but this is an opinion, not data. I asked for data.

mathjazz commented 7 years ago

Comment Author: @akerbeltz

I know what the German dialect map looks like. The map represents maximum geographical spread, NOT speaker density. According to that map, all of Munich speaks Bavarian or Hamburg Platt. Which they don't (cf this report https://www.welt.de/wissenschaft/article113938439/Muetter-Medien-Mobilitaet-Warum-Dialekte-sterben.html which states that in Hamburg the number of people who speak Platt has dropped between 1984 and 2007 from 29% to 10%).

Zibi, you know code, I know linguistics, I have a degree in the stuff. Which, if you want to label it "opinion", makes it an "expert opinion". I don't have time to research the entire dataset for your convenience. Ready the bug on CLDR, there are some specifics there.

mathjazz commented 7 years ago

Comment Author: @zbraniecki

I agree with :guerojeff.

mathjazz commented 7 years ago

Comment Author: @mathjazz

Created attachment 8868781 a-fullpage.png

I agree there's a value in having the "countries where spoken" information available for each locale.

We should spin that off as a separate bug though, because it opens a few additional questions. In particular:

Going back to the original problem: attached is an example of implementing a tier system that Michael proposed. I used the same ranges as Google Play uses for the number of downloads, which splits languages in ~10 groups. That number I find to be a good compromise between A) solving the problem of virtually all speaker numbers being wrong and B) giving us a granular enough grouping of teams.

Attached file: a-fullpage.png (image/png, 577515 bytes) Description: a-fullpage.png

mathjazz commented 7 years ago

Comment Author: @Pike

I find that data hard to digest, and tedious to read. Like, 50-100 thousand and 50-100 million are almost the same.

Maybe there's a way to color-code this? On a logarithmic scale or so?

Also, curious, what happens if you sort by the population column in the draft patch?

mathjazz commented 7 years ago

Comment Author: @mathjazz

(In reply to Axel Hecht [:Pike] from comment #10)

I find that data hard to digest, and tedious to read. Like, 50-100 thousand and 50-100 million are almost the same.

I agree, I tried not to make the column too wide.

There are at least two other variants: http://stackoverflow.com/a/11537826 http://stackoverflow.com/a/34025940

I prefer the second, because it's a closed interval.

Also, curious, what happens if you sort by the population column in the draft patch?

There's no patch yet, but we can make numbers sort properly regardless of the presentation (similarly as we do in the latest activity column for example).

mathjazz commented 7 years ago

Comment Author: @mathjazz

Created attachment 8868785 Numbers only

Slightly updated proposal, using numbers instead of words.

1-5 million vs. 1 - 5.000.000

Attached file: b-fullpage.png (image/png, 603504 bytes) Description: Numbers only

mathjazz commented 7 years ago

Comment Author: @akerbeltz

That's a lot of zeroes to take in ;)

How about we just use m (million) and k (thousand) so you'd get 1-5m 5-10m 0.5-1k

It would keep the column narrow and certainly in the English speaking word m and k are very very common abbreviations, even in spoken English "five kay" (instead of five thousand) is very common these days.

mathjazz commented 6 years ago

Comment Author: Paul Rausch <paulrausch@gmail.com>

Why not just use the ethnologue data for number of speakers? That's what UNESCO, Wikipedia etc use. Focusing on literate speakers is highly discriminatory against regional and minority languages as well.

mathjazz commented 6 years ago

Comment Author: Eduardo Trápani <etrapani@gmail.com>

  1. Get rid of the data completely, including from the database.

That would be my option. But since you said devs use it, I would only show it to devs. If that's not possible, let's hide from as many people as possible.

Facts: the data is way off, it servers no real purpose but to help devs see if a "big" language is not complete. Let's keep it to the devs and spare the rest of us the pain/anger/disbelief of seeing numbers that do not reflect reality.

If you insist on keeping those numbers, for whatever reason, then lets ask each community to provide them, with references, of course.

mathjazz commented 6 years ago

Comment Author: Eduardo Trápani <etrapani@gmail.com>

That information is not only useless for speakers/linguists (which surely have better/preferred sources for the languages they are interested in) but it could be an active impediment for the development of some languages languages, already neglected at the national/regional level. Take Triqui for example as an example of an active locale in Pontoon:

Pontoon: 4,500 literarate speakers Mexico Census: 25,000 speakers (in 2010, on an seemingly upward trend) [1]

A Triqui speaker might go from, "why do it?", to "hey let's do it!", base on those numbers. We surely don't want to influence communities and collaboration by publishing data that is not really that accurate.

As a side note, they use the latin script and, lately, up to three writing systems (depending on the intended public), so "literate speakers" doesn't mean what it would mean for other languages, with one established writing system.

[1] http://site.inali.gob.mx/pdf/libro_lenguas_indigenas_nacionales_en_riesgo_de_desaparicion.pdf