rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
165 stars 23 forks source link

Notation for inline auxiliary for any kind of character set #177

Open MrBrezina opened 2 weeks ago

MrBrezina commented 2 weeks ago

There are situations when we want to list auxiliary characters for other kinds of things (e.g. punctuation, numerals), perhaps it would be better to have an inline notation for optional/auxiliary characters that could be used in any list of characters.

Instead of:

base: a b c
auxiliary: x y z

we could have the following (or use different escape character):

base: a b c \x \y \z
kontur commented 2 weeks ago

I'm torn on this one. In a way the base vs auxiliary is a very binary distinction, and as you mention, it could be extended to more than just the characters of base. However, adding more implicit notation seems like it will be less clear and less simple to author.

That said, this would be pretty neat. What if any auxiliary chars would be in parenthesis (I think that's not interfering with yaml parsing, but would need custom parsing all yaml strings)?:

name: English
orthographies:
- autonym: English
  characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Œ a b c d e f g h i j k l m n o p q r s t u v w x y z æ œ (À Á Ç È É Ê Ë Ï Ñ Ô Ö à á ç è é ê ë ï ñ ô ö)
  currency: $ ¢ £ € (¥)
  marks: (◌̀ ◌́ ◌̂ ◌̃ ◌̈ ◌̧)
  numerals: 0 1 2 3 4 5 6 7 8 9
  punctuation: '. , ; : ? ! “ ” ‘ ’ ' ( ) (% & ¿ ¡)

Food for thought.

Also, I wonder if "extended" makes more sense as a term in the docs and CLI parameters. Like checking for basic language support vs checking for extended language support.

kontur commented 2 weeks ago

Also, for now we're not set on what kind of requirement the currency/numerals/punctuation) for — we talked about them either as opt-in or auxiliary level requirements, but having this more nuanced notation might open the door to also having some core currency/numerals/punctuation as base level required.

MrBrezina commented 2 weeks ago

So far, I have actually managed without it. See #155 We could use this notation to distinguish the Standard and Alternative notions as described here: https://en.wikipedia.org/wiki/Quotation_mark

moyogo commented 2 weeks ago

Why is putting them in auxiliary not an option?

MrBrezina commented 1 week ago

We would not be able to say whether it is punctuation, currency, numeral or character.

d

On Fri, Aug 30, 2024 at 14:31, Denis Moyogo Jacquerye @.***(mailto:On Fri, Aug 30, 2024 at 14:31, Denis Moyogo Jacquerye < wrote:

Why is putting them in auxiliary not an option?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>

moyogo commented 1 week ago

Unicode character categories would help but there may be a few exceptions where a character category doesn't match its use in a language orthography system I guess.

kontur commented 1 week ago

I suppose this very much has pros and cons, regardless of what such an implementation would look like. Either is conceptually nice, having and not having an auxiliary attribute. Not having it, we're not over-crowding the attributes and leaving the reader to guess what exactly "auxiliary" means. Syntactic highlight of such characters could be more readable overall and, case at hand, their categories would be obvious. However, having the dedicated attribute is a clear signal of the different levels and that the database does indeed make this distinction.

Unicode character categories would help but there may be a few exceptions where a character category doesn't match its use in a language orthography system I guess.

Yes, I was thinking about this as well, and had the same reservation. Firstly, it's just less distinct, but secondly, I too saw cases where e.g. modifier characters or e.g. apostrophe-like symbols may be auxiliary, and it is unclear if they are punctuation or character.

What do we think of the above proposed (...) notation? I think it would be "typographically" easy to comprehend and doesn't add more programmery syntax to editing the database files.