Add Squamish (squ) + research documentation

justinpenner commented 1 month ago

Here's a preliminary entry for the Squamish language. This is my first language submission to Hyperglot, and I thought it would be helpful, for myself and others, to document my process in researching it.

For some background, I don't speak Squamish, but I live in the region this indigenous language is from. Prior to my research, I already had some familiarity with the language due to it being used prominently in place names and signage. I also frequently do graphic design work for the Squamish Nation and other local clients, which often involves typesetting in this language.

Research

I was able to find several sources: Wikipedia, FirstVoices iOS keyboard app, Typotheque's book Indigenous North American Type, and a Squamish–English dictionary from the indigenous collection at my local public library.

There were a number of differences in the orthographies documented by each of these sources, but each source gave me a fuller picture which helped me to decide what to do about the inconsistencies. The character sets I found were as follows:

23 common among all sources:

7 a e h i k l m n p s t u w x y á é í ú ◌̓ ◌̱ ḵ

6 additional characters in Wikipedia:

' o z ʔ ʼ ’

3 additional characters in Typotheque:

o z ’

7 additional characters in Squamish–English dictionary:

! , - . ? c ’

7 additional characters in FirstVoices:

! ' , - . ? c

From the above, I made the following decisions:

Keep all 23 common characters.
Keep c because it is very commonly used in Squamish, therefore its omission from two sources appears to be an oversight.
Keep Wikipedia's ʼ U+02BC MODIFIER LETTER APOSTROPHE in addition to the common ’ U+2019 RIGHT SINGLE QUOTATION MARK, because right quotation mark is more commonly used due to keyboard settings, but apostrophe modifier is more semantically correct and some Apple systems map both to the same key, inserting one or the other based on context. Also keep ' U+0027 APOSTROPHE as it is used instead of right quote in some keyboard layouts.
Omit o z as I have yet to see any loanwords or uncommon orthographical preferences that make use of these letters, so it seems like Wikipedia may have included it erroneously, and Wikipedia was likely Typotheque's source for including it. These letters cannot even be typed in the FirstVoices keyboard, further evidence that they are not required. If new evidence arises in the future they could be added to auxiliary or even an alternate orthography.
Omit ʔ as no evidence was found for its usage at all. The Squamish glottal stop was standardized as 7 in the typewriter era.
Omit punctuation . , - ? ! as punctuation is not in scope for Hyperglot (yet).
Add uppercase forms for all letters, as they are used in Squamish similarly to English.
Added a design_requirement based on a preference I've observed (but not confirmed). I suspect this preference is linked to the norms seen in fonts that support IPA, and it therefore might be a preference among many indigenous North American languages.
Added a note linking back to this research.

Result (squ.yaml)

name: Squamish
orthographies:
- autonym: Sḵwx̱wú7mesh sníchim
  base: A Á C E É H I Í K Ḵ L M N P S T U Ú W X Y a á c e é h i í k ḵ l m n p s t u ú w x y 7 ʼ ’ ''
  marks: ◌̓ ◌̱ ◌́
  script: Latin
  status: primary
  design_requirements: Curly comma accents and right quotes are preferred (used on consonants), to differentiate from acute accents (used on vowels).
source:
- Wikipedia
- Indigenous North American Type (2023), Typotheque
- Squamish-English Dictionary (2011), Peter Jacobs and Damara Jacobs
- FirstVoices iOS keyboard
speakers: 1
speakers_date: 2014
status: living
validity: preliminary
note: Research documented at https://github.com/rosettatype/hyperglot/issues/172

Have I missed anything or made any errors in relation to Hyperglot or the Squamish language? I’ll leave this issue open for a bit in case anyone has feedback, then I will submit a pull request.

kontur commented 1 month ago

Super, thanks @justinpenner, this is very valuable! Both to have your approach documented, and to include such a very local language.

All in all this looks already very good. You can also open a PR and we refine in the PR; it's often easier to comment or amend code in the PR interface.

A few pointers:

I think your deductions regarding the different corpus data makes sense; if something seems like it might be used it can always be listed in auxiliary but even those only so if there is evidence of actual use.
apostrophe vs right quote vs single quote vs comma accent is exactly #82 — for some languages there is a clear letter in use, for others there is a canonical encoding but many are used, and for others any of a number of alternatives go. My preference would be to use the most canonical (if one exists) in the base and add "alternates" to auxiliary as composed letters (for marks) or standalone letters (for characters) and make a note about this like you have done.
The hyperglot-save command will automatically extract marks used in the base (and auxiliary) and adds them. This also means marks that are listed but not encountered in base will be implicitly required. When there are alternative ways of adding the "accent" with different marks, it is good to list to alternates in the marks explicitly like you have done, so they will be required when those are commonly encountered with writing the language. Note, though, that those marks will be required for a font to support the language. If you feel some of those base + mark combinations are less used alternative spellings you could also omit the mark from the marks and add the combinations with that mark as auxiliary, which will make those marks only required if checking with --support=auxiliary. Super tiny difference, though, and in this case I think having all the mark variations required by listing them in marks explicitly makes sense.
There is an unresolved issue #116 about digraphs... for now maybe you could add them as a note, so once we somehow make their input possible we can add them easily. (now hyperglot-save will split digraphs into their letters, which isn't ideal in terms of preserving and representing the orthography, even if it technically covers the required characters)
Feel free to add punctuation: . , - ? ! — it's good to have some data for when we add the attribute (working on that); is the language explicitly not using : ;, for example? Is - really used, e.g. for hyphenating words?
Interesting to have a numeral in the base!
note: Research documented at https://github.com/rosettatype/hyperglot/issues/172 👍 (you can make note: a list like the sources, if there have several.
I'd maybe add a note about the speaker count being 1 — I first suspected this was an input error until I looked up the language.
When making a PR add yourself to the contributors :)

justinpenner commented 1 month ago

@kontur thanks, I've made a couple edits and submitted a PR (#173). Sadly the speaker count was indeed only 1 person in 2014, but happily, I found a new source (Canada's 2021 census) stating there are now 25 native speakers! I updated Wikipedia, too.

Can we already include digraphs and trigraphs in base, though? There are already some languages (bkm, lam, xav, tlh, esu) that have them listed, and I haven't found any problems caused by this.

kontur commented 1 month ago

Can we already include digraphs and trigraphs in base, though? There are already some languages (bkm, lam, xav, tlh, esu) that have them listed, and I haven't found any problems caused by this.

Sorry, yes, include away! I misremembered, we have already changed that the input doesn't "vanish" those away on saving!

justinpenner commented 1 month ago

Great, I'll add them to the PR. This language has a lot of them. I agree with comments in #116 that they're not too useful or interesting for a type designer, but they're part of the orthography in this case, which is what we're documenting, and I already have the research.

moyogo commented 1 month ago

It would be useful to have the graphemes using combining marks, like m̓ n̓ l̓ x̱, no? Designers may not be aware those are used and should be handled to support Squamish.

justinpenner commented 1 month ago

@moyogo Yes, I will add those to the PR as well. Earlier I thought that the Hyperglot database was only cataloguing individual characters, but apparently base+mark pairs and multigraphs are allowed, and should be included.

justinpenner commented 1 month ago

The pull request #173 now includes base+mark pairs and multigraphs:

  base: A AA AW AW̓ AY AY̓ Á CH CHʼ E EY EY̓ EW EW̓ É H I II IW IW̓ Í K Kʼ KW KWʼ Ḵ Ḵʼ ḴW ḴWʼ L L̓ LH M M̓ N N̓ P Pʼ S SH T Tʼ TLʼ TS TSʼ U UU UY UY̓ Ú W W̓ XW X̱ X̱W Y Y̓ a aa aw aw̓ ay ay̓ á ch chʼ e ey ey̓ ew ew̓ é h i ii iw iw̓ í k kʼ kw kwʼ ḵ ḵʼ ḵw ḵwʼ l l̓ lh m m̓ n n̓ p pʼ s sh t tʼ tlʼ ts tsʼ u uu uy uy̓ ú w w̓ xw x̱ x̱w y y̓ 7 ʼ ’ ''
  marks: ◌̓ ◌̱
  punctuation: . , - ? !

These are all listed in a pronunciation guide section at the beginning of the Squamish–English dictionary, which seems to be more complete than the orthography listed on Wikipedia.

I used ʼ U+02BC MODIFIER LETTER APOSTROPHE rather than ’ U+2019 RIGHT SINGLE QUOTATION MARK in the digraphs. It isn't standardized so either is acceptable in everyday use of the language, but apostrophe modifier is more semantically correct, I think.

MrBrezina commented 1 month ago

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

On Fri, May 31, 2024 at 19:39, Justin Penner @.***(mailto:On Fri, May 31, 2024 at 19:39, Justin Penner < wrote:

The pull request #173 now includes base+mark pairs and multigraphs:

base

:

A AA AW AW̓ AY AY̓ Á CH CHʼ E EY EY̓ EW EW̓ É H I II IW IW̓ Í K Kʼ KW KWʼ Ḵ Ḵʼ ḴW ḴWʼ L L̓ LH M M̓ N N̓ P Pʼ S SH T Tʼ TLʼ TS TSʼ U UU UY UY̓ Ú W W̓ XW X̱ X̱W Y Y̓ a aa aw aw̓ ay ay̓ á ch chʼ e ey ey̓ ew ew̓ é h i ii iw iw̓ í k kʼ kw kwʼ ḵ ḵʼ ḵw ḵwʼ l l̓ lh m m̓ n n̓ p pʼ s sh t tʼ tlʼ ts tsʼ u uu uy uy̓ ú w w̓ xw x̱ x̱w y y̓ 7 ʼ ’ ''

marks

:

◌̓ ◌̱

punctuation

:

. , - ? !

These are all listed in a pronunciation guide section at the beginning of the Squamish–English dictionary, which seems to be more complete than the orthography listed on Wikipedia.

I used ʼ U+02BC MODIFIER LETTER APOSTROPHE rather than ’ U+2019 RIGHT SINGLE QUOTATION MARK in the digraphs. It isn't standardized so either is acceptable in everyday use of the language, but apostrophe modifier is more semantically correct, I think.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

justinpenner commented 1 month ago

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

I did think of it, but is there any usefulness of including them? I think mixed case digraphs would only be useful to include when they have their own unique codepoint like ǲ U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. Otherwise mixed case digraphs aren't adding anything semantically unique, nor are they adding any new codepoints.

MrBrezina commented 1 month ago

I do not have the answer st the moment. I would expect them there for the completeness sake.

It would make most sense to only include one case of everything (incl. single characters), but then German and ß. I will ponder.

On Fri, May 31, 2024 at 21:22, Justin Penner @.***(mailto:On Fri, May 31, 2024 at 21:22, Justin Penner < wrote:

Justin, have you considered a mixed-case digraphs such as Aa Aw Ay Ch…?

I did think of it, but is there any usefulness of including them? I think mixed case digraphs would only be useful to include when they have their own unique codepoint like ǲ U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. Otherwise mixed case digraphs aren't adding anything semantically unique, nor are they adding any new codepoints.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

moyogo commented 1 month ago

Besides exceptions like eszett, adding uppercase or titlecase is redundant data. It can be useful for verbosity but it can also just be automatically derived from Unicode data. Then only exception need to be added.

For example base="a b c", special_casing={"c": "X"}. For caseless orthographies, there could be a flag caseless=true.

MrBrezina commented 1 month ago

Agreed, I was thinking of something along those lines. It will make the database smaller too.

@kontur is working on better inheritance, so we could bundle that with it.

On Sat, Jun 1, 2024 at 13:15, Denis Moyogo Jacquerye @.***(mailto:On Sat, Jun 1, 2024 at 13:15, Denis Moyogo Jacquerye < wrote:

Besides exceptions like eszett, adding uppercase or titlecase is redundant data. It can be useful for verbosity but it can also just be automatically derived from Unicode data. Then only exception need to be added.

For example base="a b c", special_casing={"c": "X"}. For caseless orthographies, there could be a flag caseless=true.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

kontur commented 1 month ago

Regarding the double capital digraphs, I think it helps to think of it not as a "caps lock" typing thing, but how an "own word" (omg, my terminology fails me here... city, name, etc) might need it. Official orthographies seem to list them as such, e.g. in Hungarian and Czech I believe it is the first letter only that is capitalized. Of course, there could be orthographic differences, so I am not categorically opposing double uppercase. From a font validation point of view these don't matter, so I am leaning towards leniency on how those are noted, but of course it would be nice to have consistency.

Regarding the overall uppercase: Yes, it is redundant, but I think we explicitly included them at a point, since the yaml files as such should resemble an "full" orthography, not just codepoints which by capitalizing render the full orthography, and font checking should consider the uppercase as well. "Size" doesn't matter, I'd say. We may implement convenience automatisation that adds or warns missing upper/lower case, if consistency is an issue. Technically it would be trivial to have only lowercase and expand the yamls with uppercase variants when parsed.

kontur commented 4 weeks ago

Thank you @justinpenner for the contribution and clarifying your approach, it helps us improve the instructions for new language additions and hopefully serves as a good reference to other future contributors.

If more discussion regarding uppercase/digraphs is needed a new issue is better suited.

jcrippen commented 2 weeks ago

FYI, the Tlingit entry tli.yaml uses the same U+0331 on K/k and X/x as well as on G/g so the notes there may be helpful. The use of U+0331 with these letters is relatively common among Northwest Coast language orthographies (Haida, Coast Tsimshian, Nisg̱aʼa, Gitksan, Kwakʼwala, Sechelt, etc.).

rosettatype / hyperglot

Add Squamish (squ) + research documentation #172

Research

Result (squ.yaml)