rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
162 stars 22 forks source link

Clarify how to deal with other characters (symbols, currencies, punctuation) #60

Open kontur opened 2 years ago

kontur commented 2 years ago

More as a reminder that we are aware of these and that they are currently not included in the data.

E.g. #58 has Tamil symbols which seem like they would be required, or at least relevant, for language support, but there is no way (other than as note) to include this information in the current data scheme.

MrBrezina commented 5 months ago

Could be an alternative orthography.

kontur commented 5 months ago

I'd rather include them as optional attributes of existing orthographies and follow the same route as for #154 meaning define them on a script level, inherit on data access, and allow per-language/orthography overwrites if set.

alerque commented 5 months ago

Punctuation and symbols are not alternative orthography. If you put it there it will make the actual alternate orthography data harder to use.

Also you have really tricky situations with punctuation (even more so than the occasional alphabet snafu) with official vs. traditional data. For example it's pretty obvious now that should be included in a font for proper support of Turkish even though it isn't part of the alphabet. The Turkish Lira symbol is new enough that many fonts claiming to support Turkish don't have it at all.

Basic punctuation like .,:;!?/- and more are also official, as are less common things like em-dashes . Then you get into the things that are definitely not official (even explicitly excluded) like en-dashes and ampersands &. The Turkish language institute expressly forbids the use of ampersands in favor of spelling out ve, but of course that doesn't stop everybody from using it. It isn't as widely used as in English, but still common enough that –for example– a font claiming to support Turkish should probably include it even if on a technicality it could be excluded.

Even more grey areas would be curly braces {}. Parenthesis () and square brackets [] are mentioned with specific roles in the orthography, but no mention is made of braces. In the wild some people use them where parenthesis belong, some where brackets belong, others randomly for other things, some not at all.

MrBrezina commented 5 months ago

I meant the alternative orthography specifically for Tamil. I will enquire with our Tamil contact about this. He suggested to put them in auxiliary, initially, but it seems a bit too much.

@alerque thank you for the input. I agree, punctuation gets wild. In #154 (sorry for the partial duplication) I have suggested something that could work: general definition per script, clarified preference (legal standard, dominant practice) in language. We are trying to find a “reasonably regular approach”. If you have suggestions or more tough cases, drop it there, please.

Currencies are probably best handled by an independent check as they, imho, should be mapped to countries/states rather than languages.

alerque commented 5 months ago

Yes, currencies are more of a country / locale issue rather than language / script. Obviously there is going to be some cross over. Isn't currency info in CLDR too? Importing those related symbols when the locale and language do map to each-other might still be quite useful.

And yes, punctuation chaos like the en-dash / ampersand stuff I described are different kind of issue.

kontur commented 5 months ago

A couple of rather unsorted thoughts on these issues:

To illustrate the last, let's imagine a "Latin" script level definition like:

base:
  punctuation: . ; : - — ? ! ' " “ ”
  currency: $ €
  numerals: 0 1 2 3 4 5 6 7 8 9
auxiliary:
  punctuation: ‘ ’ ¿ ¡ & ( ) [ ] { }
  currency: ...a ton of currency symbols here…

and for example for French the primary orthography could look something like:

base: ...
auxiliary: ...
marks: ...
punctuation: « » ‹ ›

and have a "historic" orthography that additionally has:

currency: ₣

This would be interpreted as: . ; : - — ? ! ' " “ ” $ € 0 1 2 3 4 5 6 7 8 9 (inherited) as well as « » ‹ › (orthography level) are required for base support, whereas ‘ ’ ¿ ¡ & ( ) [ ] { } (inherited) are required for auxiliary support. is only required for the historic orthography. All this would only apply if the user opts in to the checks for currencies/punctuation/numerals.

A Turkish orthography could look something like:

base: ...
auxiliary: ...
marks: ...
currency: ₺

All Latin base currencies/punctuation/numerals are required, as well as . Again, all these apply only if the user opts in.

How does this sound? We could collect a few more pseudo definitions for other scripts and languages and see if this reveals some limitations.

justinpenner commented 3 months ago

This issue of how to deal with non-grapheme characters gets more and more complicated the more I think about it. Some thoughts:

Here's roughly what I'm imagining for extending language definitions, using English as an example:

name: English
orthographies:
- autonym: English
  base: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Œ a b c d e f g h i j k l m n o p q r s t u v w x y z æ œ
  auxiliary: À Á Ç È É Ê Ë Ï Ñ Ô Ö à á ç è é ê ë ï ñ ô ö
  base_numerals: 0 1 2 3 4 5 6 7 8 9
  base_symbols: "` ~ ! @ # $ % ^ & * ( ) - _ = + [ ] { } \\ | ; : ' \" < > , . ? /"
  auxiliary_symbols: ƒ ¶ § © ® ™ ° ¦ † ‡ ¤
  marks: ◌̀ ◌́ ◌̂ ◌̃ ◌̈ ◌̧
  script: Latin
  status: primary
kontur commented 3 months ago

Thanks for the input! It is really hard to come up with a satisfying approach to this, so more opinions certainly are welcome!

  • Currency should definitely not be in a script definition, and probably not in language definitions, either. Whether a currency symbol is required not really dependent on the language, but on the country.

Very good point. What I meant by this is that on a conceptual level there could be inheritance from the script. If you look at your example, almost all language of the Latin scripts would probably have the exact same base_numerals. What I meant is that we could maintain a script-based very baseline list of characters. Any orthography of said script would implicitly inherit those, unless explicitly overwritten by the orthography. This would serve both, as a way to keep the data somewhat less repetitive, and as a way to really narrow down on the most essential characters of each category.

Generally speaking I would aim for these categories to list only essential symbols, so no refined splitting into base/aux. For example for auxiliary symbols it is sheer impossible where to consistently draw the line, with regard to when a symbols is not "base" any more, but also when a symbol is obscure enough to not be included in "auxiliary". I'd rather approach this by listing only very certain ones in "base" and leave it to the understanding of the user that there is near endless symbols that may somehow be relevant in a language, but impossible to list exhaustively. Like with orthographies, HG should list what we are certain of. In your example, I would consider the base_symbols a list I'd be comfortable with requiring a font to support (for all of Latin, actually), and auxiliary_symbols I'd consider too arbitrary to include. And, for example, for French, I would opt to explicitly overwrite base_symbols to include guillemets, for German to include baseline quotes, etc.

(It's a question of notation, but we could even allow extending the default, instead of mere overwriting, e.g. define symbols: *latin_symbols « » ‹ ‘ where latin_symbols would be the inherited default; or writing it explicitly like this for all languages of Latin, instead of doing some implicit requirement, with the yaml variable making it more clear that the attribute exists and what it includes, even if for most of Latin language it would be only `latin_symbols*`)

  • If we add a new set of country definitions, it could include not only currency symbols, but also a list of official languages. This would open up new potential use cases for Hyperglot that I would be happy to see. For example, imagine if the Hyperglot database could be used to tell a type designer what languages are spoken in their continent or region, and generate character sets to help them design fonts to support their own local languages.

HG has steered clear of this deliberately, so far. These kinds of definitions get argumentative very quickly, particularly with regard to historical, minority and diaspora type of use cases. I do see the benefit from a user perspective, though.

The question of currency symbols remains, however. What I wonder is if there are false positives for when currencies are linked to languages. A naive example of those would be English, where HG orthographies do not distinguish between British and American English, so an English currency list would include both $, £ (and ¢). And less obvious, English is in frequent use as an administrative language or lingua franca, e.g. you might expect it to also include ₹ for use throughout India, etc.. Again, taking the most conservative approach would be to include only those symbols of which we have high certainty, so for English I'd argue this would be $ and £ but no other currency symbols. Someone interested in "localized" support of a language, in this case English in use in India, would be "missing" a warning about a font missing ₹, but then again, HG does not support this kind of localized check to begin with, so worrying about currency symbols in this context of is moot. Would there be cases where this approach would result in a false positive? Meaning are there orthographies in use in different locales with distinct currency symbols where including both/several would result in an unacceptable currency symbol requirement for the other locales?

MrBrezina commented 3 months ago

In #155 I have mentioned this:

The idea was to define general set of punctuation for the script and preference for the language.

I think this is a key decision we need to make. Do we provide:

  1. base list of symbols and language-specific extension
  2. base list of symbols and language-specific override
  3. base list of symbols and language-specific recommendations/preference
  4. combination of 1 or 2 with 3 specifically 3 for quotes in Latin*

* Anything else in the Latin script that we know varies across languages? (I am not talking about typesetting rules which are of course diverse.)

My feeling is similar to @kontur ’s that symbols are of a script rather than of a language.

Once we decide on that, we can devise a notation and take it from there. I have some thoughts on how to decide to keep it practical, but I need to ponder.

The base list could be scraped from major corpora for each script with some frequency threshold and if particular. This way we avoid arbitrary subjective decisions.

I consider currencies a separate question, linked to country. I would be happy if we had that list, though.

If one wanted to provide an interface, ideally, one would ask two questions: (a) what languages does a user want to support and (b) what countries are they considering?