rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
162 stars 22 forks source link

Add Squamish (squ) + research documentation #172

Closed justinpenner closed 4 weeks ago

justinpenner commented 1 month ago

Here's a preliminary entry for the Squamish language. This is my first language submission to Hyperglot, and I thought it would be helpful, for myself and others, to document my process in researching it.

For some background, I don't speak Squamish, but I live in the region this indigenous language is from. Prior to my research, I already had some familiarity with the language due to it being used prominently in place names and signage. I also frequently do graphic design work for the Squamish Nation and other local clients, which often involves typesetting in this language.

Research

I was able to find several sources: Wikipedia, FirstVoices iOS keyboard app, Typotheque's book Indigenous North American Type, and a Squamish–English dictionary from the indigenous collection at my local public library.

There were a number of differences in the orthographies documented by each of these sources, but each source gave me a fuller picture which helped me to decide what to do about the inconsistencies. The character sets I found were as follows:

23 common among all sources:

7 a e h i k l m n p s t u w x y á é í ú ◌̓ ◌̱ ḵ

6 additional characters in Wikipedia:

' o z ʔ ʼ ’

3 additional characters in Typotheque:

o z ’

7 additional characters in Squamish–English dictionary:

! , - . ? c ’

7 additional characters in FirstVoices:

! ' , - . ? c

From the above, I made the following decisions:

Result (squ.yaml)

name: Squamish
orthographies:
- autonym: Sḵwx̱wú7mesh sníchim
  base: A Á C E É H I Í K Ḵ L M N P S T U Ú W X Y a á c e é h i í k ḵ l m n p s t u ú w x y 7 ʼ ’ ''
  marks: ◌̓ ◌̱ ◌́
  script: Latin
  status: primary
  design_requirements: Curly comma accents and right quotes are preferred (used on consonants), to differentiate from acute accents (used on vowels).
source:
- Wikipedia
- Indigenous North American Type (2023), Typotheque
- Squamish-English Dictionary (2011), Peter Jacobs and Damara Jacobs
- FirstVoices iOS keyboard
speakers: 1
speakers_date: 2014
status: living
validity: preliminary
note: Research documented at https://github.com/rosettatype/hyperglot/issues/172

Have I missed anything or made any errors in relation to Hyperglot or the Squamish language? I’ll leave this issue open for a bit in case anyone has feedback, then I will submit a pull request.

kontur commented 1 month ago

Super, thanks @justinpenner, this is very valuable! Both to have your approach documented, and to include such a very local language.

All in all this looks already very good. You can also open a PR and we refine in the PR; it's often easier to comment or amend code in the PR interface.

A few pointers:

justinpenner commented 1 month ago

@kontur thanks, I've made a couple edits and submitted a PR (#173). Sadly the speaker count was indeed only 1 person in 2014, but happily, I found a new source (Canada's 2021 census) stating there are now 25 native speakers! I updated Wikipedia, too.

Can we already include digraphs and trigraphs in base, though? There are already some languages (bkm, lam, xav, tlh, esu) that have them listed, and I haven't found any problems caused by this.

kontur commented 1 month ago

Can we already include digraphs and trigraphs in base, though? There are already some languages (bkm, lam, xav, tlh, esu) that have them listed, and I haven't found any problems caused by this.

Sorry, yes, include away! I misremembered, we have already changed that the input doesn't "vanish" those away on saving!

justinpenner commented 1 month ago

Great, I'll add them to the PR. This language has a lot of them. I agree with comments in #116 that they're not too useful or interesting for a type designer, but they're part of the orthography in this case, which is what we're documenting, and I already have the research.

moyogo commented 1 month ago

It would be useful to have the graphemes using combining marks, like m̓ n̓ l̓ x̱, no? Designers may not be aware those are used and should be handled to support Squamish.

justinpenner commented 1 month ago

@moyogo Yes, I will add those to the PR as well. Earlier I thought that the Hyperglot database was only cataloguing individual characters, but apparently base+mark pairs and multigraphs are allowed, and should be included.

justinpenner commented 1 month ago

The pull request #173 now includes base+mark pairs and multigraphs:

  base: A AA AW AW̓ AY AY̓ Á CH CHʼ E EY EY̓ EW EW̓ É H I II IW IW̓ Í K Kʼ KW KWʼ Ḵ Ḵʼ ḴW ḴWʼ L L̓ LH M M̓ N N̓ P Pʼ S SH T Tʼ TLʼ TS TSʼ U UU UY UY̓ Ú W W̓ XW X̱ X̱W Y Y̓ a aa aw aw̓ ay ay̓ á ch chʼ e ey ey̓ ew ew̓ é h i ii iw iw̓ í k kʼ kw kwʼ ḵ ḵʼ ḵw ḵwʼ l l̓ lh m m̓ n n̓ p pʼ s sh t tʼ tlʼ ts tsʼ u uu uy uy̓ ú w w̓ xw x̱ x̱w y y̓ 7 ʼ ’ ''
  marks: ◌̓ ◌̱
  punctuation: . , - ? !

These are all listed in a pronunciation guide section at the beginning of the Squamish–English dictionary, which seems to be more complete than the orthography listed on Wikipedia.

I used ʼ U+02BC MODIFIER LETTER APOSTROPHE rather than ’ U+2019 RIGHT SINGLE QUOTATION MARK in the digraphs. It isn't standardized so either is acceptable in everyday use of the language, but apostrophe modifier is more semantically correct, I think.

MrBrezina commented 1 month ago

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

On Fri, May 31, 2024 at 19:39, Justin Penner @.***(mailto:On Fri, May 31, 2024 at 19:39, Justin Penner < wrote:

The pull request #173 now includes base+mark pairs and multigraphs:

base

:

A AA AW AW̓ AY AY̓ Á CH CHʼ E EY EY̓ EW EW̓ É H I II IW IW̓ Í K Kʼ KW KWʼ Ḵ Ḵʼ ḴW ḴWʼ L L̓ LH M M̓ N N̓ P Pʼ S SH T Tʼ TLʼ TS TSʼ U UU UY UY̓ Ú W W̓ XW X̱ X̱W Y Y̓ a aa aw aw̓ ay ay̓ á ch chʼ e ey ey̓ ew ew̓ é h i ii iw iw̓ í k kʼ kw kwʼ ḵ ḵʼ ḵw ḵwʼ l l̓ lh m m̓ n n̓ p pʼ s sh t tʼ tlʼ ts tsʼ u uu uy uy̓ ú w w̓ xw x̱ x̱w y y̓ 7 ʼ ’ ''

marks

:

◌̓ ◌̱

punctuation

:

. , - ? !

These are all listed in a pronunciation guide section at the beginning of the Squamish–English dictionary, which seems to be more complete than the orthography listed on Wikipedia.

I used ʼ U+02BC MODIFIER LETTER APOSTROPHE rather than ’ U+2019 RIGHT SINGLE QUOTATION MARK in the digraphs. It isn't standardized so either is acceptable in everyday use of the language, but apostrophe modifier is more semantically correct, I think.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

justinpenner commented 1 month ago

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

I did think of it, but is there any usefulness of including them? I think mixed case digraphs would only be useful to include when they have their own unique codepoint like Dz U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. Otherwise mixed case digraphs aren't adding anything semantically unique, nor are they adding any new codepoints.

MrBrezina commented 1 month ago

I do not have the answer st the moment. I would expect them there for the completeness sake.

It would make most sense to only include one case of everything (incl. single characters), but then German and ß. I will ponder.

On Fri, May 31, 2024 at 21:22, Justin Penner @.***(mailto:On Fri, May 31, 2024 at 21:22, Justin Penner < wrote:

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

I did think of it, but is there any usefulness of including them? I think mixed case digraphs would only be useful to include when they have their own unique codepoint like Dz U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. Otherwise mixed case digraphs aren't adding anything semantically unique, nor are they adding any new codepoints.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

moyogo commented 1 month ago

Besides exceptions like eszett, adding uppercase or titlecase is redundant data. It can be useful for verbosity but it can also just be automatically derived from Unicode data. Then only exception need to be added.

For example base="a b c", special_casing={"c": "X"}. For caseless orthographies, there could be a flag caseless=true.

MrBrezina commented 1 month ago

Agreed, I was thinking of something along those lines. It will make the database smaller too.

@kontur is working on better inheritance, so we could bundle that with it.

On Sat, Jun 1, 2024 at 13:15, Denis Moyogo Jacquerye @.***(mailto:On Sat, Jun 1, 2024 at 13:15, Denis Moyogo Jacquerye < wrote:

Besides exceptions like eszett, adding uppercase or titlecase is redundant data. It can be useful for verbosity but it can also just be automatically derived from Unicode data. Then only exception need to be added.

For example base="a b c", special_casing={"c": "X"}. For caseless orthographies, there could be a flag caseless=true.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

kontur commented 1 month ago

Regarding the double capital digraphs, I think it helps to think of it not as a "caps lock" typing thing, but how an "own word" (omg, my terminology fails me here... city, name, etc) might need it. Official orthographies seem to list them as such, e.g. in Hungarian and Czech I believe it is the first letter only that is capitalized. Of course, there could be orthographic differences, so I am not categorically opposing double uppercase. From a font validation point of view these don't matter, so I am leaning towards leniency on how those are noted, but of course it would be nice to have consistency.

Regarding the overall uppercase: Yes, it is redundant, but I think we explicitly included them at a point, since the yaml files as such should resemble an "full" orthography, not just codepoints which by capitalizing render the full orthography, and font checking should consider the uppercase as well. "Size" doesn't matter, I'd say. We may implement convenience automatisation that adds or warns missing upper/lower case, if consistency is an issue. Technically it would be trivial to have only lowercase and expand the yamls with uppercase variants when parsed.

kontur commented 4 weeks ago

Thank you @justinpenner for the contribution and clarifying your approach, it helps us improve the instructions for new language additions and hopefully serves as a good reference to other future contributors.

If more discussion regarding uppercase/digraphs is needed a new issue is better suited.

jcrippen commented 2 weeks ago

FYI, the Tlingit entry tli.yaml uses the same U+0331 on K/k and X/x as well as on G/g so the notes there may be helpful. The use of U+0331 with these letters is relatively common among Northwest Coast language orthographies (Haida, Coast Tsimshian, Nisg̱aʼa, Gitksan, Kwakʼwala, Sechelt, etc.).