rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
162 stars 22 forks source link

Hawaiian okina correction #79

Closed justinpenner closed 2 years ago

justinpenner commented 2 years ago

Hawaiian okina consonant is represented by U+02BB MODIFIER LETTER TURNED COMMA. Previously hyperglot.yaml listed it as right apostrophe.

Source: https://en.wikipedia.org/wiki/%CA%BBOkina

lianghai commented 2 years ago

Is there a way to specify alternative code points for conceptually the same grapheme?

Yes, U+02BB ʻ MODIFIER LETTER TURNED COMMA is considered to be the “proper” encoding of okina, but it’s apparently encoded in various non-idealistic code points, including U+2018 ‘ LEFT SINGLE QUOTATION MARK (and the U+2019 ’ RIGHT SINGLE QUOTATION MARK currently in the data file). A font that doesn’t support those all will only support Hawaiian texts in an idealistic dream.

justinpenner commented 2 years ago

@lianghai That's an interesting question. Some existing Hawaiian texts probably have left/right/vertical quotation marks encoded instead of the proper turned comma characters.

Modern Hawaiian keyboard layouts use U+02BB (turned comma), so I think at best these homoglyphs can be considered deprecated, which might qualify them as auxiliary characters, not base characters (see README_database.md).

But I would hesitate to even include them as auxiliary characters. How many other languages have homoglyphs like this from the pre-Unicode/pre-international-keyboards era? There must be a lot, and I donʻt know if it would be worthwhile to add them all to the Hyperglot database, especially since most of these would already be included in the most basic character sets, having originated from a time when character sets were limited or keyboard layouts from another more dominant language were used.

I think it would be good to make a decision on this topic for Hyperglot as a whole, so does anyone else have thoughts to add? To summarize the question: When homoglyphs with alternate codepoints were historically used prior to a language having Unicode and/or keyboard support, should these be included in auxiliary?

MrBrezina commented 2 years ago

We usually include legacy/non-ideal code points in auxiliary. See for example Romanian. Theoretically, this can be dealt with by creating an alternative (historical, secondary) orthography. However, I feel both Romanian and Hawaiian are best served by including the “alternative code points” in auxiliary. Now as to the question which code points should be included, I would say code points that represent firmly/officially established practice.

@justinpenner Thank you for the contribution and sorry for taking too long to fix it in (busy with other matters). Do you want to create a pull request?

lianghai commented 2 years ago

@justinpenner:

How many other languages have homoglyphs like this from the pre-Unicode/pre-international-keyboards era? There must be a lot, …

It’s not merely a historical problem.

… and I donʻt know if it would be worthwhile to add them all to the Hyperglot database, …

One of the core functions of Hyperglot is checking whether a given character set covers what a language needs. If you consider the reality is not “worthwhile”, it’s just an idealistic database that only tells you what idealists think a language needs.

… especially since most of these would already be included in the most basic character sets, …

Are you saying, in order to make sense of the Hawaiian data, I need to somehow know that smart quotes may also be needed for the particular issue of encoding okina? Okay, so the alternative encodings of okina happen to be covered by the generally available basic glyph set, then what about other cases like Skolt Sami? It’s a general problem (the “proper” Unicode encoding is so obscure that people use other code points all the time), and it’s only a special case that okina’s alternative encodings happen to be those characters already used by English.

… having originated from a time when character sets were limited or keyboard layouts from another more dominant language were used.

This is a major and common misunderstanding of the problem. No, it’s not a historical problem. The alternative characters will always be there; there will always be users producing new texts with them. Because the “proper” Unicode character simply isn’t necessary to average users, so they don’t care.

Do English users all care about the typographically “proper” smart quotes? No, many don’t, and that’s not merely a result of previously limited character sets. That’s a result of average users don’t really care about certain requirements that matter to professionals.

Is the differentiation between okina and a left quote helpful? Yes. With the differentiation, we get to process them separately. But is it practical to expect average to be able to make a distinction between two characters that usually look identical? Do users actually care about it enough to make efforts to ensure such invisible differentiation?

And without users being able to ensure the differentiation, in text processings can you even take advantage of the originally assumed helpful differentiation between okina and a left quote?

It’s not just ”an interesting question”. It’s a question that Unicode experts today keep ask themselves so the Unicode Standard isn’t just an idealistic dream.

I think it would be good to make a decision on this topic for Hyperglot as a whole, …

Yes. There should be more explicitly defined strategy for this common problem. Another example: Tamil ஸ்ரீ srī and its requirement of “ஶ” per the “proper” encoding.

To summarize the question: When homoglyphs with alternate codepoints were historically used prior to a language having Unicode and/or keyboard support, should these be included in auxiliary?

That’s not the correct question, because it’s not a historical problem.

@MrBrezina:

Theoretically, this can be dealt with by creating an alternative historical orthography.

It’s not historical. See my explanation above.

alerque commented 2 years ago

But I would hesitate to even include them as auxiliary characters. How many other languages have homoglyphs like this from the pre-Unicode/pre-international-keyboards era?

Lots, sadly. If you'll recall we recently ran into something similar on the Turkish character sets. The original data had some homoglyphs that are relics of old encodings –or for this language family homoglyphs from Cyrillic based locales that look like Turkish Latin glyphs and are sometimes (incorrectly) used by translators from those regions– that we moved out to alternatives, leaving the primary character set be only the set blessed by Unicode and the Türk Dil Kurumu.

justinpenner commented 2 years ago

@lianghai makes some really good points, and I feel like I'm probably the least knowledgeable person in the room here. My framing of this as a "historical" problem was indeed clouded by the idealist notion that everyone is using correct keyboard layouts and typography when it's available on their devices. But I'm sure there are many people who use alternate homoglyphs even if their devices are capable of typing the "correct" characters. There might be many bilingual Hawaiian/English speakers who would rather not switch their keyboard settings constantly and might use whatever looks similar to an okina on their US English keyboard layout.

So it is important to acknowledge these homoglyphs in some languages like Hawaiian, and I think auxiliary is probably the best place for them in the current database structure. Like the Turkish character set mentioned by @alerque, the Hawaiian alphabet, keyboard and codepoints are now standardized with the okina as U+02BB turned comma. It seems Hyperglot has already established that standardized characters generally go into base and non-standardized characters from past or present can go in auxiliary.

So in this case I would recommend that we keep the turned comma in base, and I would not add anything to auxiliary until someone has time to scan some Hawaiian corpora to find out exactly which alternate characters are commonly used. I think it's practical to say that base sets should be higher priority than auxiliary, and auxiliary will involve more depth of research. I think we can merge this PR as-is and the auxiliary characters could be added in a future PR.

kontur commented 2 years ago

@lianghai has good observations about real world use, but I think it is misleading to deduct that the database should infer standardized orthography from this. We're not determining if a text is in a certain language, but if a font can write a certain language.

In my view a note about such glyphs, and their commonly used stand-ins, would be most appropriate. They are not alternative orthographies in any real sense (even if not marked as historical, which is possible, too). Any inclusion in font validation checks (even as auxiliary, which you need to opt-in to), muddies the support result. Should all be required? Should some other than just the standardized one be required? What is the threshold for those which should be listed and thus required? Should a font pass that has a stand-in but not the standardized one? Any which way, I don't see this resulting in better results.

Hyperglot provides certainty that the glyphs listed in the orthography are needed without question.

lianghai commented 2 years ago

… but I think it is misleading to deduct that the database should infer standardized orthography from this.

No one was suggesting those alternative encodings have anything to do with a “standardized orthography” in any sense.

I simply asked a question in order to understand whether Hyperglot is intended to be idealistic or practical when it comes to such encoding issues. It’s a matter of whether the database represents the real-world texts or an idealistic encoding recommendation.

MrBrezina commented 2 years ago

Can you unpack what you mean by the terms “idealistic“, “an idealistic dream”, and “practical”? Then perhaps we can help you answer your question whether Hyperglot is or is not what you think. I will set up a separate issue in a sec to discuss clarification of handling of alternative characters. It is a relevant topic to address.

MrBrezina commented 2 years ago

@kontur merged. I will double check @lianghai ’s non-idealistic code points with a native speaker and, in case they are relevant, include them in a note for now.