Develop one canonical, vetted, reviewed, set of character lists per language

davelab6 commented 8 years ago

Quoting @behdad

Fontconfig has a minimal character set for each language, and is well-tested.

For font work, we need both a minimal set, as well as a "nice to have" set, which is used when making subsets (ie. include them if they are in the font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.

ultrasquid commented 8 years ago

On http://www.cheapprofonts.com/Languages.php there is a chart that may be a useful reference. It lists characters and corresponding Unicode numbers needed for many languages that use the Latin alphabet, though there are some major omissions (notably Vietnamese). Certainly incomplete, but one must start somewhere.

On Mon, Jun 6, 2016 at 12:28 PM, Dave Crossland notifications@github.com wrote:

Fontconfig has a minimal character set for each language, and is well-tested.

For font work, we need both a minimal set, as well as a "nice to have" set, which is used when making subsets (ie. include them if they are in the font).

And then separate data for digits, currency signs, rare marks, etc.

Start with CLDR and add missing data there, and fix bugs around it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/davelab6/pyfontaine/issues/89, or mute the thread https://github.com/notifications/unsubscribe/ABtEiYSWOk3iz7lmDhhR4bQlldHOKyO7ks5qJHTYgaJpZM4IvOwH .

Jason Pagura zimbach at gmail dot com

behdad commented 8 years ago

cc @brawer

behdad commented 8 years ago

cc @twardoch

behdad commented 8 years ago

cc @MrBrezina

brawer commented 8 years ago

@ultrasquid, did you know that Unicode CLDR collects which characters are used in what language? For each language, it has four sets of “exemplar characters”. The main set is shown here; the auxiliary set here. Admittedly, the charts could be made a little more readable, but CLDR is a nice central place for collecting this information. If anything is missing or wrong, just file a CLDR ticket. From CLDR, the exemplar character information flows into libraries such as ICU, which is built into many systems. But you can also take the data directly from the source XML files. Search for “exemplar”, for example in the French data.

MrBrezina commented 8 years ago

My two cents. What needs to be established first is a clear definition of the set of characters used by certain language. I know at least two approaches and I am sure an educate linguist would come up with more and perhaps more precise:

official language alphabet
characters that appear (frequently) in texts in text of the language, i.e., also characters used in family names of foreign origin etc.

I think that the difference between (2) and (1) is what CLDR calls auxiliary which seems like a good approach.

In terms of what is useful to save for a single language, I thought of this.

<language iso-639-2="?" name="Hunzib" script="Cyrl" status="todo" opentype-tag="?">
    <characters type="required">АБВГДЕӘЖЗИЙКЛМНОПРСТУӮФХЦЧШЪЫЫЬЭӀабвгдеәжзийклмнопрстуӯфхцчшъыыьэӏ</characters>
    <characters type="recommended" note="punctuation">‹›«»…</characters>
    <shaping type="required">
        <feature opentype-tag="mark">
            <bases>АЕӘОЭаеәоэ</bases>
            <marks>̄</marks>
        </feature>
    </shaping>
</language>

Note: it is just a preliminary example. I do not know whether the data is correct. A while ago, I thought some indication of the shaping would be useful as some of the combinations are required, but might not have a codepoint. But I am not sure if this is not too much. Perhaps just noting that a feature (in this case mark) needs to exist would be sufficient. I am aware this is OpenType specific, but there is no reason this could not include other formats in the future. Perhaps TTX format would be better for that.

@ultrasquid I found at least one mistake in Czech, so I would be careful about this list.

Useful resources (reliability varies):

Unicode CLDR (as @brawer just pointed out), but I hear the data is not in a very good shape
WebINK Character Sets
EasySpeak by Typekit
Latin Plus by Underware: http://www.underware.nl/latin_plus/
Data on Languages: Institut of Esthonian Language http://www.eki.ee/letter
at Rosetta we have also have, currently a bit random, collection of language definitions we would be happy to share or develop further

MrBrezina commented 8 years ago

cc: @moyogo

khaledhosny commented 8 years ago

CLDR is not terribly reliable, at least not for Arabic. Its list of Arabic characters is lacking important characters, while rarely used characters are present. See for example https://github.com/w3c/alreq/issues/49.

brawer commented 8 years ago

CLDR is not terribly reliable

Putting on my Unicode hat for a sec: Please, please, please report bugs to CLDR so we can fix them.

moyogo commented 8 years ago

CLDR should be the place for information on characters used by locales. A lot of checks can be derived from the characters and character sequences in the exemplars. But in many cases that is not sufficient.

There’s actually more information that font producers would want to be able to refer to when testing the coverage of their fonts. Glyph shape or position variation information is out of the scope of Unicode and the CLDR, yet it is a crucial part of proper locale support. Having a character doesn’t mean a font supports the languages using that character. At the same time some of these requirements are style specific and may not apply to every style. But I digress...

In any case, it might be useful to make a fork of the CLDR character exemplar data, expand and modify it with references and push the fixes upstream.

davelab6 commented 8 years ago

Huerta Tipo have released comparison sites for Devanagari, Cyrillic and Greek, I think this descriptivist approach might be more helpful than a prescriptivist guide :)

alexeiva commented 8 years ago

@davelab6 The Cyrillic comparison is something I have developed as a fork from Huerta Tipo's projects locally. Sorry, it still isn't publicly available, as I am extending it, and fixing tech issues.

MrBrezina commented 8 years ago

With regards to @davelab6 suggestions @moyogo comments. (Sorry if I am stating the obvious here.) Absolutely agreed that there is more to language support than a list of codepoints. However, part of it has to stay in the domain of type design (appropriating shapes) and type use (using these shapes) for the time being. We do not have tools and methodologies to distinguish essential and non-essential in the shapes (think structure vs. style). And if we cannot do that, we cannot say that some shape complies with expectations and some do not. And even if we had, it would depend on more variables than just style. It also depends on whom you talk to (e.g. Polish kreska or Bulgarian Cyrillic discussions). Moreover, the preferences keep on changing and any kind of rules are being broken in amazing ways in specific contexts. So there is no way we can tackle language support completely at the moment. I think.

To digress even more and to take Central European languages as an example. There are too many (even awarded) typefaces which include the right codepoints, even readable shapes you could say, but so badly executed that a great majority of professional Czech designers would be really disappointed if they had to use them.

So what I think we are looking for here is an automated way to diagnose fonts for language support potential based on Unicode codepoints. Nothing more. It is important to be aware of the limits. The question is how do we go about that and where do we draw the line. Personally, I think including some indication of required features is a good idea (so users get a red flag and can go: “Aha, I need something else to be there. I need to research a bit.”), also perhaps some notes. Maybe just the notes. I am not sure if describing the features is all that useful anymore. It adds too much complexity.

See, what I do not know is how to tackle things like accents positions (those which are not codified in Unicode in precomposed form), e.g. for ways of writing Yoruba, or conjuncts for Indic languages. Do we just say that there need to be particular features and leave it up to the user to clarify whether the support is there?

davelab6 commented 7 years ago

@graphicore here's the list of languages I'm most interested in:

Afrikaans Albanian Arabic Azerbaijani Bulgarian Catalan Croatian Czech Danish Dutch Estonian Filipino Finnish French German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Kazakh Kyrgyz Latvian Lithuanian Macedonian Malay Marathi Mongolian Nepali Norwegian (Bokmål) Persian Polish Portuguese Portuguese (European) Romanian Russian Serbian Serbian (Latin) Slovak Slovenian Spanish Spanish (Latin America) Swahili Swedish Thai Turkish Ukrainian Urdu Uzbek Vietnamese

simoncozens commented 4 years ago

How does this relate to https://github.com/rosettatype/langs-db? Would it be better to "bridge" to Rosetta's YAML file and auto-instantiate charset objects from that?

behdad commented 4 years ago

cc @matthiasclasen

behdad commented 4 years ago

Humm. Does Rosetta's really not enable issue-tracker? @MrBrezina

At any rate, whichever is deemed more canonical, I'd love to merge it with fontconfig's database and make fontconfig generate from it...

MrBrezina commented 4 years ago

@behdad I have activated now. :) We did not consider it quite ready.

btw. we renamed it to Hyperglot today (Langs DB was too general) and @kontur refactored the tool and tests for the new structure of the database. We plan to add more languages in the next few weeks.

googlefonts / pyfontaine

Develop one canonical, vetted, reviewed, set of character lists per language #89