Clarify how to deal with other characters (symbols, currencies, punctuation)

kontur commented 2 years ago

More as a reminder that we are aware of these and that they are currently not included in the data.

E.g. #58 has Tamil symbols which seem like they would be required, or at least relevant, for language support, but there is no way (other than as note) to include this information in the current data scheme.

MrBrezina commented 5 months ago

Could be an alternative orthography.

kontur commented 5 months ago

I'd rather include them as optional attributes of existing orthographies and follow the same route as for #154 meaning define them on a script level, inherit on data access, and allow per-language/orthography overwrites if set.

alerque commented 5 months ago

Punctuation and symbols are not alternative orthography. If you put it there it will make the actual alternate orthography data harder to use.

Also you have really tricky situations with punctuation (even more so than the occasional alphabet snafu) with official vs. traditional data. For example it's pretty obvious now that ₺ should be included in a font for proper support of Turkish even though it isn't part of the alphabet. The Turkish Lira symbol is new enough that many fonts claiming to support Turkish don't have it at all.

Basic punctuation like .,:;!?/- and more are also official, as are less common things like em-dashes —. Then you get into the things that are definitely not official (even explicitly excluded) like en-dashes – and ampersands &. The Turkish language institute expressly forbids the use of ampersands in favor of spelling out ve, but of course that doesn't stop everybody from using it. It isn't as widely used as in English, but still common enough that –for example– a font claiming to support Turkish should probably include it even if on a technicality it could be excluded.

Even more grey areas would be curly braces {}. Parenthesis () and square brackets [] are mentioned with specific roles in the orthography, but no mention is made of braces. In the wild some people use them where parenthesis belong, some where brackets belong, others randomly for other things, some not at all.

MrBrezina commented 5 months ago

I meant the alternative orthography specifically for Tamil. I will enquire with our Tamil contact about this. He suggested to put them in auxiliary, initially, but it seems a bit too much.

@alerque thank you for the input. I agree, punctuation gets wild. In #154 (sorry for the partial duplication) I have suggested something that could work: general definition per script, clarified preference (legal standard, dominant practice) in language. We are trying to find a “reasonably regular approach”. If you have suggestions or more tough cases, drop it there, please.

Currencies are probably best handled by an independent check as they, imho, should be mapped to countries/states rather than languages.

alerque commented 5 months ago

Yes, currencies are more of a country / locale issue rather than language / script. Obviously there is going to be some cross over. Isn't currency info in CLDR too? Importing those related symbols when the locale and language do map to each-other might still be quite useful.

And yes, punctuation chaos like the en-dash / ampersand stuff I described are different kind of issue.

kontur commented 5 months ago

A couple of rather unsorted thoughts on these issues:

I'd consider this and #154 the same issue, essentially; same solution, different category of characters
We could have checking currencies/punctuation/numerals as optional validation arguments, meaning by default we do not fail fonts lacking support unless the user opts in to that check; at least until we have a bit more user feedback; e.g. I find it reasonable to drop fonts not containing ₺ for Turkish, if indeed this is an opt-in check; the opt-in could be per category, or one for all of these discussed here
We could bluntly add these characters as part of base/aux amongst other characters; in any spot where the data is presented/output we can use unicode categories to split them apart from letters into their respective categories to display separately as currencies/punctuation/numerals — this would be less neat in the database files, but it would accommodate explicit base/aux levels for everything; not sure I like the messy nature of this approach
Alternatively we could define e.g. a script level set of currencies/punctuation/numerals which is split into base and auxiliary and which is applicable for all languages of that script. And any orthography-level overwrites are treated as required for base support, meaning the script definitions for base should be concise enough to really apply to all languages of that script, whereas the orthography definitions should be very narrow, too, to really only add new base requirements that are really considered essential to that language. The auxiliary definitions on script level could be very wide, to provide very good coverage if checking on that detailed level (and opting in to the currency/punctuation/numeral check to begin with).

To illustrate the last, let's imagine a "Latin" script level definition like:

base:
  punctuation: . ; : - — ? ! ' " “ ”
  currency: $ €
  numerals: 0 1 2 3 4 5 6 7 8 9
auxiliary:
  punctuation: ‘ ’ ¿ ¡ & ( ) [ ] { }
  currency: ...a ton of currency symbols here…

and for example for French the primary orthography could look something like:

base: ...
auxiliary: ...
marks: ...
punctuation: « » ‹ ›

and have a "historic" orthography that additionally has:

currency: ₣

This would be interpreted as: . ; : - — ? ! ' " “ ” $ € 0 1 2 3 4 5 6 7 8 9 (inherited) as well as « » ‹ › (orthography level) are required for base support, whereas ‘ ’ ¿ ¡ & ( ) [ ] { } (inherited) are required for auxiliary support. ₣ is only required for the historic orthography. All this would only apply if the user opts in to the checks for currencies/punctuation/numerals.

A Turkish orthography could look something like:

base: ...
auxiliary: ...
marks: ...
currency: ₺

All Latin base currencies/punctuation/numerals are required, as well as ₺. Again, all these apply only if the user opts in.

How does this sound? We could collect a few more pseudo definitions for other scripts and languages and see if this reveals some limitations.

justinpenner commented 3 months ago

This issue of how to deal with non-grapheme characters gets more and more complicated the more I think about it. Some thoughts:

Currency should definitely not be in a script definition, and probably not in language definitions, either. Whether a currency symbol is required not really dependent on the language, but on the country. David mentioned this in #156.
If we add a new set of country definitions, it could include not only currency symbols, but also a list of official languages. This would open up new potential use cases for Hyperglot that I would be happy to see. For example, imagine if the Hyperglot database could be used to tell a type designer what languages are spoken in their continent or region, and generate character sets to help them design fonts to support their own local languages.
Punctuation and other typographic symbols should be combined into one "symbols" category, in my opinion. It should have a broad definition to include most non-graphemes except numerals and currency symbols. Otherwise, if we have two categories: "punctuation" ! ( ) - [ ] { } ; : ' " , . ? / and "other typographic symbols" ~ @ # $ % ^ & * _ = + \ | < > ƒ ¶ § © ® ™ ° ¦ † ‡ ¤, it will have much overlap and lead to disagreements about which bucket each character belongs in.
A "symbols" category could still be broken into "required" and "auxiliary", with the caveat that it would lead to some debate over where to draw the line for each language. I would start by looking at keyboard layouts, and put those symbols in the "required" bucket.

Here's roughly what I'm imagining for extending language definitions, using English as an example:

name: English
orthographies:
- autonym: English
  base: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Œ a b c d e f g h i j k l m n o p q r s t u v w x y z æ œ
  auxiliary: À Á Ç È É Ê Ë Ï Ñ Ô Ö à á ç è é ê ë ï ñ ô ö
  base_numerals: 0 1 2 3 4 5 6 7 8 9
  base_symbols: "` ~ ! @ # $ % ^ & * ( ) - _ = + [ ] { } \\ | ; : ' \" < > , . ? /"
  auxiliary_symbols: ƒ ¶ § © ® ™ ° ¦ † ‡ ¤
  marks: ◌̀ ◌́ ◌̂ ◌̃ ◌̈ ◌̧
  script: Latin
  status: primary

kontur commented 3 months ago

Thanks for the input! It is really hard to come up with a satisfying approach to this, so more opinions certainly are welcome!

Currency should definitely not be in a script definition, and probably not in language definitions, either. Whether a currency symbol is required not really dependent on the language, but on the country.

Very good point. What I meant by this is that on a conceptual level there could be inheritance from the script. If you look at your example, almost all language of the Latin scripts would probably have the exact same base_numerals. What I meant is that we could maintain a script-based very baseline list of characters. Any orthography of said script would implicitly inherit those, unless explicitly overwritten by the orthography. This would serve both, as a way to keep the data somewhat less repetitive, and as a way to really narrow down on the most essential characters of each category.

Generally speaking I would aim for these categories to list only essential symbols, so no refined splitting into base/aux. For example for auxiliary symbols it is sheer impossible where to consistently draw the line, with regard to when a symbols is not "base" any more, but also when a symbol is obscure enough to not be included in "auxiliary". I'd rather approach this by listing only very certain ones in "base" and leave it to the understanding of the user that there is near endless symbols that may somehow be relevant in a language, but impossible to list exhaustively. Like with orthographies, HG should list what we are certain of. In your example, I would consider the base_symbols a list I'd be comfortable with requiring a font to support (for all of Latin, actually), and auxiliary_symbols I'd consider too arbitrary to include. And, for example, for French, I would opt to explicitly overwrite base_symbols to include guillemets, for German to include baseline quotes, etc.

(It's a question of notation, but we could even allow extending the default, instead of mere overwriting, e.g. define symbols: *latin_symbols « » ‹ ‘ where latin_symbols would be the inherited default; or writing it explicitly like this for all languages of Latin, instead of doing some implicit requirement, with the yaml variable making it more clear that the attribute exists and what it includes, even if for most of Latin language it would be only `latin_symbols*`)

If we add a new set of country definitions, it could include not only currency symbols, but also a list of official languages. This would open up new potential use cases for Hyperglot that I would be happy to see. For example, imagine if the Hyperglot database could be used to tell a type designer what languages are spoken in their continent or region, and generate character sets to help them design fonts to support their own local languages.

HG has steered clear of this deliberately, so far. These kinds of definitions get argumentative very quickly, particularly with regard to historical, minority and diaspora type of use cases. I do see the benefit from a user perspective, though.

The question of currency symbols remains, however. What I wonder is if there are false positives for when currencies are linked to languages. A naive example of those would be English, where HG orthographies do not distinguish between British and American English, so an English currency list would include both $, £ (and ¢). And less obvious, English is in frequent use as an administrative language or lingua franca, e.g. you might expect it to also include ₹ for use throughout India, etc.. Again, taking the most conservative approach would be to include only those symbols of which we have high certainty, so for English I'd argue this would be $ and £ but no other currency symbols. Someone interested in "localized" support of a language, in this case English in use in India, would be "missing" a warning about a font missing ₹, but then again, HG does not support this kind of localized check to begin with, so worrying about currency symbols in this context of is moot. Would there be cases where this approach would result in a false positive? Meaning are there orthographies in use in different locales with distinct currency symbols where including both/several would result in an unacceptable currency symbol requirement for the other locales?

MrBrezina commented 3 months ago

In #155 I have mentioned this:

The idea was to define general set of punctuation for the script and preference for the language.

I think this is a key decision we need to make. Do we provide:

base list of symbols and language-specific extension
base list of symbols and language-specific override
base list of symbols and language-specific recommendations/preference
combination of 1 or 2 with 3 specifically 3 for quotes in Latin*

* Anything else in the Latin script that we know varies across languages? (I am not talking about typesetting rules which are of course diverse.)

My feeling is similar to @kontur ’s that symbols are of a script rather than of a language.

Once we decide on that, we can devise a notation and take it from there. I have some thoughts on how to decide to keep it practical, but I need to ponder.

The base list could be scraped from major corpora for each script with some frequency threshold and if particular. This way we avoid arbitrary subjective decisions.

I consider currencies a separate question, linked to country. I would be happy if we had that list, though.

If one wanted to provide an interface, ideally, one would ask two questions: (a) what languages does a user want to support and (b) what countries are they considering?

rosettatype / hyperglot

Clarify how to deal with other characters (symbols, currencies, punctuation) #60