graphicore / specimenTools

Apache License 2.0
29 stars 5 forks source link

Language coverage details #19

Open thlinard opened 7 years ago

thlinard commented 7 years ago

In Charset Coverage Details, I'd like to see separated the old and new (2016) Google sets . Latin, Cyrillic and Greek next to Cyrillic Plus, Latin Expert, etc. is a bit confusing.

Same for Charset Coverage Details.

Language support by all available Char Sets is sometimes erroneous. Greek Core doesn't include Latin Plus, for example.

graphicore commented 7 years ago

In Charset Coverage Details, I'd like to see separated the old and new (2016) Google sets . Latin, Cyrillic and Greek next to Cyrillic Plus, Latin Expert, etc. is a bit confusing.

I see. Easiest would be for me to just sort the legacy sets to the bottom. But I can also make a clean separation. I'm not sure how the GF API will handle the legacy encodings next to the novel ones., but that will be important to the users of the font specimen in the end.

Language support by all available Char Sets is sometimes erroneous. Greek Core doesn't include Latin Plus, for example

Aha, OK. We should probably discuss this at google/fonts. If Greek Core doesn't include Latin Plus, where is it taking it's (standard) punctuation from? I have some similar questions on my list. The discussion of https://github.com/google/fonts/issues/624 is related.

Note that the online version uses the files of https://github.com/google/fonts/pull/642 where I included Latin Plus in Greek Core, which surly could be wrong.

thlinard commented 7 years ago

We should probably discuss this at google/fonts. If Greek Core doesn't include Latin Plus, where is it taking it's (standard) punctuation from?

Hum… From Latin Core? But Latin Core seems to exist only virtually. Probably this should be clarified.

graphicore commented 7 years ago

Yeah, I'm in the process of writing something up. There are a few issues I have with this charset analysis. As a matter of fact in the moment you posted I just created this list:

0x0021 ! EXCLAMATION MARK
0x0022 " QUOTATION MARK
0x0026 & AMPERSAND
0x0028 ( LEFT PARENTHESIS
0x0029 ) RIGHT PARENTHESIS
0x002A * ASTERISK
0x002C , COMMA
0x002D - HYPHEN-MINUS
0x002E . FULL STOP
0x002F / SOLIDUS
0x003A : COLON
0x003B ; SEMICOLON
0x0040 @ COMMERCIAL AT
0x005B [ LEFT SQUARE BRACKET
0x005C \ REVERSE SOLIDUS
0x005D ] RIGHT SQUARE BRACKET
0x00A7 § SECTION SIGN
0x00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
0x00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
0x0301 ́ COMBINING ACUTE ACCENT
0x0308 ̈ COMBINING DIAERESIS
0x2010 ‐ HYPHEN
0x2013 – EN DASH
0x2014 — EM DASH
0x2026 … HORIZONTAL ELLIPSIS

These are the chars missing from Greek Core when asking the CLDR.

But Latin Core seems to exist only virtually. Probably this should be clarified.

In https://github.com/google/fonts/issues/624 I came to the same conclusion :-) The question for me is whether to pack this into the https://github.com/google/fonts/pull/642 PR or do it with a new PR. Also, from 642 I should probably remove the including of Latin Plus into Greek Core, heh?

(just updated the charlist above: removed duplicates, sorted)

thlinard commented 7 years ago

These are the chars missing from Greek Core when asking the CLDR.

And not the Arabic numerals?

Also, from 642 I should probably remove the including of Latin Plus into Greek Core, heh?

Yes, probably. One set (GF Greek Pro) needs some characters from GF Latin Plus and Pro sets, like stated in the README.md, but unless Latin Pro is intended as a prerequisite for all GF, it's too much for Greek coverage.

graphicore commented 7 years ago

And not the Arabic numerals?

Interesting. I'm using this: https://github.com/unicode-cldr/cldr-misc-modern/blob/master/main/el/characters.json And of that the keys main.characters.exemplarCharacters and main.characters.punctuation also I'm using the JavaScript String.prototype.toUpperCase function on all chars, which should do the right thing and change the char if Unicode defines an uppercase, otherwise leave it. There are no numerals in this document though. Similarly, for Arabic no numerals are defined either: https://github.com/unicode-cldr/cldr-misc-modern/blob/master/main/ar/characters.json

Good find, thanks!

The information should be somewhere, maybe in https://github.com/unicode-cldr/cldr-numbers-modern? But on a first glance it seems to define rather number formating. Do you know where to look for the numerals in the CLDR?

Also, from 642 I should probably remove the including of Latin Plus into Greek Core, heh?

Yes, probably.

Will do.

One set (GF Greek Pro) needs some characters from GF Latin Plus and Pro sets

I've seen that. This needs a decision. Either we do kind of "technical" Namelist files, so that we don't repeat ourselves (if this is feasible, it would be quite a bummer to end up with one Namelist per char) or we just include these chars in GF Greek Pro. "technical" Namelist files wouldn't be available via the Fonts API, just for us to define charsets.

I wrote something yesterday for Dave to look at, it's interesting for this discussion as well, sort of:

20 It suggests that we can support languages even if we don't support the whole GF-charset. This could have implications on how we define GF-charsets.

thlinard commented 7 years ago

Do you know where to look for the numerals in the CLDR?

It seems to be https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/numberingSystems.json

graphicore commented 7 years ago

Ah, great thanks. It's linked to the locales via cldr-numbers-modern:

excerpt


      "numbers": {
        "defaultNumberingSystem": "arab",
        "otherNumberingSystems": {
          "native": "arab"
},

for el:

      "numbers": {
        "defaultNumberingSystem": "latn",
        "otherNumberingSystems": {
          "native": "latn",
          "traditional": "grek"
},
thlinard commented 7 years ago

Oh, they called "latn" the Arabic numerals, I suppose… And "arab" the Indic numerals used in the Arabic script.

thlinard commented 7 years ago

The list still lacks basic characters, like # % < > + = × ÷

graphicore commented 7 years ago

Oh, they called "latn" the Arabic numerals, I suppose… And "arab" the Indic numerals used in the Arabic script.

Yeah, right, but it seems to do the right thing anyways:

      "latn": {
        "_digits": "0123456789",
        "_type": "numeric"
},

Though they make it more complicated for me sometimes, "_type": "algorithmic" … :

      "grek": {
        "_rules": "greek-upper",
        "_type": "algorithmic"
},

The list still lacks basic characters, like # % < > + = × ÷

I guess there's the question if these are needed to write the language. I'm not really deep into the concepts of CLDR.